[ckan-dev] [ckan-discuss] Harvesting Dublin Core documents

William Waites ww at eris.okfn.org
Thu Nov 25 16:04:22 UTC 2010

* [2010-11-25 15:23:21 +0000] John Bywater <john.bywater at appropriatesoftware.net> écrit:

] >In any event, I would suggest that the different translators stay
] >different. The "trigger" of a harvesting job should not assume the
] >programming language they're written in and instead interact with them
] >at the unix command level. The trigger mechanism, either cron or a
] >queue listener, would be configured beforehand to say what script to
] >use according to what type of data.
] Thanks. Let's keep this discussion open. I'm not entirely sure what you 
] mean in the last paragraph. CKAN is written in Python, which could run 
] on Windows (or something else). That is, there might not be a unix 
] command level available. But do tell me more about this if you'd like to.

I've had experiences where the only document converter that was easily
available was a proprietary blob of compiled C++ code. There are some
(intriguing and probably not worth reimplementing in python) tools
that are written in Java that do some intersting transformations
(cf. XLWrap [1] that we learned about at OGDC). Other people might be
more comfortable writing harvesting jobs in PHP or whatever their
favourite language is. This goes to the "ecosystem of tools" argument
advanced in the other mail.

For example, I might have a process that listened to the recent
changes RSS feed, looked for a tag called "rdf" then went and fetched
some bit of the dataset (e.g. its voiD [2] description) and then
updated the package or even created (if necessary) new packages for
the subsets using the API.

Or one might do the same thing, but generate a voiD description and
publish it somewhere (this is the use case that cygri and I discussed
in London).

Or listen to the RSS feed and submit some information to an
aggregation or indexing service.

All of these sorts of things are possible now of course. What I am
saying is that we should not be encouraging the CKAN codebase to grow
(in fact I think it should shrink) and when we implement new functions
it should be done in this sort of way and extensions of the models or
APIs only contemplated when it appears impossible or very inconvenient
to accomplish something this way.


[1] https://github.com/markbirbeck/xlwrap 
[2] where DCat is for describing catalogues and datasets, voiD can be
thought of as a subclass for describing RDF datasets.
William Waites
9C7E F636 52F6 1004 E40A  E565 98E3 BBF3 8320 7664

More information about the ckan-dev mailing list