[ckan-dev] Harvesting data catalogs - proposal for datamap.json

David Read david.read at hackneyworkshop.com
Thu Mar 7 10:43:54 UTC 2013


Dmitry,

I guess it's more a political question of whether the metadata is
'pushed' from the agency or 'pulled' from data.gov.

The current harvesting is more of a 'push' and makes the agency
responsible for what ends up on data.gov. You're proposing more of a
'pull' where data.gov takes more responsibility.

The harvesting was designed in the UK where the publishers are
(legally) responsible for getting their metadata onto data.gov.uk and
from there onto the EU site. We have issues such as publishers
struggling to publish to the correct XML specifications, put the info
info the wrong fields, change their metadata IDs requiring withdrawal
on data.gov.uk and reharvest. All these things need to be sorted and
managed on the publisher side, not centrally. So it would not make
sense for us to move the responsibility to data.gov.uk.

So although I understand technically what you propose, I don't see any
of your reasons for doing this. Perhaps you can elaborate on why this
is "sensible"?

David

On 1 March 2013 17:52, Dmitry Kachaev <dmitry.kachaev at gmail.com> wrote:
>     Hi everyone,
> I'm thinking about ways to improve data.gov. Recently it was said that
> data.gov is moving towards CKAN platform -
> http://ckan.org/2013/02/04/us-data-gov-to-use-ckan/
>
> It seems that sensible approach is to have federated data.gov where it is
> harvesting and indexing varios agencies' data catalogs into main data.gov
> catalog/index.
>
> Currently, AFAIK, CKAN harvester is essentially manually fed with list of
> CKAN API endpoints to harvest for. What we were thinking is to introduce
> automated approach to build such index.
>
> Here is quick and dirty description:
>
> Every agency/organization that runs data catalog(s) will create single
> easily discoverable file datamap.json that will list information about its
> data catalog API/endpoint for harvesting. Such file will be put into root of
> the agency website similar to robots.txt/sitemap.xml e.g.
> agency.gov/datamap.json
>
> Example of the datamap.json file:
> {
>     "data-catalogs": [
>         {
>             "api-name": "data-json",
>             "version": "v1.0",
>             "endpoint": "http://data.mcc.gov/raw/index.json",
>             "contact": "MCC Open Data Initiative",
>             "email": "opendata at mcc.gov"
>         },
>         {
>             "api-name": "ogc-csw",
>             "version": "v2.0.2",
>             "endpoint":
> "http://geo.data.gov/geoportal/csw/discovery?Request=GetCapabilities&Service=CSW&Version=2.0.2
> ",
>             "contact": "Geo Spatial One Stop Team",
>             "email": "onestop at fgdc.gov"
>         },
>         {
>             "api-name": "socrata-api",
>             "version": "v1.0",
>             "endpoint": "http://explore.data.gov/api/",
>             "contact": "Data.gov team",
>             "email": "contact at data.gov"
>         }
>     ]
> }
>
> Such approach will allow to enumerate through all .gov websites and build
> index of all data catalog endpoints and then harvest them in unified way
> (using CKAN Harvester with extra plugins for different type of catalogs like
> Socrata or Geo catalogs supporting CSW standard)
>
> What are you thought on this approach? Are we reinventing the wheel? Is this
> a right place to ask this question?
>
> Thanks,
> Dmitry
>
> Dmitry Kachaev
> voice: (202) 527-9423
> twitter: @kachok
> mail: dmitry.kachaev at gmail.com
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>




More information about the ckan-dev mailing list