[ckan-dev] Adding harvesters to thedatahub.org

Tim McNamara tim.mcnamara at okfn.org
Tue Oct 11 09:15:52 UTC 2011


This post sketches out a few ideas for adding metadata harvesters to
thedatahub.org. The role of a harvester is to initially crawl some
remote source, enter data into thedatahub.org, then ping it over time,
keeping this up-to-date.

Other search engines are now providing access to hundreds of thousands
of datasets:

 - http://open.mflask.com/
 - http://logd.tw.rpi.edu/demo/international_dataset_catalog_search

What I think thedatahub.org community could do is focus on precision.
Those sites both index pages, they don't go to the extra step of
providing  gathering direct links resources and adding those as well.

Here is a general approach:

 - enthusiastic hacker builds a scraper in ScraperWiki to get data
from remote source
 - ScraperWiki data is then retrieved by a harvester, living at
thedatahub.org via cron

The reason why I like this approach is that no approval is required to
keep the scraper maintained. The harvester code is likely to be quite
stable.

I've created a couple of scrapers to give people a feel of what it's like:

https://scraperwiki.com/scrapers/nation_vegetation_survey_metadata/
https://scraperwiki.com/scrapers/linguistic_data_consortium_catalog/
https://scraperwiki.com/scrapers/preview_global_risk_data_platform/

Two questions: What do people think? Would anyone like to help?




More information about the ckan-dev mailing list