[ckan-dev] [ckan-discuss] Harvesting Dublin Core documents
john.bywater at appropriatesoftware.net
Thu Nov 25 17:43:06 UTC 2010
Regarding simplifying the CKAN codebase, I'm totally with you.
Despite having added quite a lot, I've always tried to do that. And the
precedent within CKAN for doing what you seem to be proposing with the
harvesting sub-domain model was set by me when I extracted the licenses
sub-domain model into a separate Python package, to be run as a separate
service, so that we could simplify CKAN software and services.
So I'm very happy to consider repeating that game with CKAN's harvesting
sub-domain model. More details below...
William Waites wrote:
> * [2010-11-25 15:23:21 +0000] John Bywater <john.bywater at appropriatesoftware.net> écrit:
> ] >In any event, I would suggest that the different translators stay
> ] >different. The "trigger" of a harvesting job should not assume the
> ] >programming language they're written in and instead interact with them
> ] >at the unix command level. The trigger mechanism, either cron or a
> ] >queue listener, would be configured beforehand to say what script to
> ] >use according to what type of data.
> ] Thanks. Let's keep this discussion open. I'm not entirely sure what you
> ] mean in the last paragraph. CKAN is written in Python, which could run
> ] on Windows (or something else). That is, there might not be a unix
> ] command level available. But do tell me more about this if you'd like to.
> I've had experiences where the only document converter that was easily
> available was a proprietary blob of compiled C++ code. There are some
> (intriguing and probably not worth reimplementing in python) tools
> that are written in Java that do some intersting transformations
> (cf. XLWrap  that we learned about at OGDC). Other people might be
> more comfortable writing harvesting jobs in PHP or whatever their
> favourite language is. This goes to the "ecosystem of tools" argument
> advanced in the other mail.
> For example, I might have a process that listened to the recent
> changes RSS feed, looked for a tag called "rdf" then went and fetched
> some bit of the dataset (e.g. its voiD  description) and then
> updated the package or even created (if necessary) new packages for
> the subsets using the API.
> Or one might do the same thing, but generate a voiD description and
> publish it somewhere (this is the use case that cygri and I discussed
> in London).
> Or listen to the RSS feed and submit some information to an
> aggregation or indexing service.
I agree with you. All those are interesting, very useful, and reasonable
propositions. And it would be hateful to clutter CKAN up with a
hundred-and-one variations on a theme.
At the same time, CKAN could have standards-based harvesting
capabilities without anybody getting too out of shape about it.
The story of how CKAN got its standard-based harvesting capabilities
(including why the harvesting sub-domain isn't already wrapped up inside
a separate software application/service), and how that story could be
continued, is below.
> All of these sorts of things are possible now of course. What I am
> saying is that we should not be encouraging the CKAN codebase to grow
> (in fact I think it should shrink) and when we implement new functions
> it should be done in this sort of way and extensions of the models or
> APIs only contemplated when it appears impossible or very inconvenient
> to accomplish something this way.
I'm curious about your measurement of more and less (through growing or
shrinking a codebase). I'm also curious about your threshold for when a
codebase would be too big.
I feel the most important thing is not size, but rather mobility. A
relatively large codebase that is clean and well tested can be very
workable. A relatively small codebase that is not clean and is not well
tested can be totally unworkable.
Story-wise, the harvesting sub-domain model is currently incorporated
into the CKAN domain model because:
- there is only one CKAN API
- the OKF didn't have an empty harvesting application
- the data.gov.uk Drupal front-end use CKAN for its catalogue service
- hence, the data.gov.uk Drupal front-end uses the CKAN API
- now, the UKLII is being developed into data.gov.uk
- the UKLP is not the simplest multi-lateral programme ever established
- the data.gov.uk site is not the simplest Web site ever established
- the UKLII is required to support distributed metadata
- the UKLII mandates support for harvesting GEMINI 2 documents only
- the UKLII mandates support for harvesting from CSW and WAF sources
- the UKLII mandates support for harvesting on demand
Therefore, the data.gov.uk Drupal front-end needed to put harvest source
and harvesting jobs into the CKAN API.
Therefore, the harvester was behind the CKAN API.
Therefore, the harvester has access to the CKAN presentation layer.
Therefore, the harvester doesn't need to use the CKAN API (converting
its CKAN Package dicts to strings, passing them to the API with an API
key, and then converting them back to CKAN Package dicts is totally
Therefore, the harvester uses the CKAN presentation layer.
Of course, from this position, we could establish separate harvester
services, create a new software application, register it on Pypi, have a
new repository, have a new virtual environment and configure Apache, fix
it up with an API key, and so on. We could facade it into the CKAN API,
so that the data.gov.uk Drupal front-end doesn't need to change.
The harvester is well factored. It hardly has any dependencies on the
rest of the CKAN codebase. So, by design, separating the harvester into
a new component would require a relatively minimal effort (relative to a
more coupled implementation).
Now, we could have firstly created a new and empty software application,
and then started to build the harvester. But what do you think the
burndown chart would have looked like? There would have been a
significant period of time where no functionality that is useful to the
customer would have been created. For a period of time, all effort would
have gone into overhead. So, the first achievement would be to undermine
the customer's confidence, and therefore increase the probability that
the project would have been cancelled before it has "properly" started.
Anyway, that's not the best way to develop software professionally. :-)
>  https://github.com/markbirbeck/xlwrap
>  where DCat is for describing catalogues and datasets, voiD can be
> thought of as a subclass for describing RDF datasets.
More information about the ckan-dev