[ckan-dev] [ckan-discuss] Harvesting Dublin Core documents

Thu Nov 25 17:43:06 UTC 2010

Hi Will,

Regarding simplifying the CKAN codebase, I'm totally with you.

Despite having added quite a lot, I've always tried to do that. And the 
precedent within CKAN for doing what you seem to be proposing with the 
harvesting sub-domain model was set by me when I extracted the licenses 
sub-domain model into a separate Python package, to be run as a separate 
service, so that we could simplify CKAN software and services.

So I'm very happy to consider repeating that game with CKAN's harvesting 
sub-domain model. More details below...

William Waites wrote:
> * [2010-11-25 15:23:21 +0000] John Bywater <john.bywater at appropriatesoftware.net> écrit:
> 
> ] >In any event, I would suggest that the different translators stay
> ] >different. The "trigger" of a harvesting job should not assume the
> ] >programming language they're written in and instead interact with them
> ] >at the unix command level. The trigger mechanism, either cron or a
> ] >queue listener, would be configured beforehand to say what script to
> ] >use according to what type of data.
> ] 
> ] Thanks. Let's keep this discussion open. I'm not entirely sure what you 
> ] mean in the last paragraph. CKAN is written in Python, which could run 
> ] on Windows (or something else). That is, there might not be a unix 
> ] command level available. But do tell me more about this if you'd like to.
> 
> I've had experiences where the only document converter that was easily
> available was a proprietary blob of compiled C++ code. There are some
> (intriguing and probably not worth reimplementing in python) tools
> that are written in Java that do some intersting transformations
> (cf. XLWrap [1] that we learned about at OGDC). Other people might be
> more comfortable writing harvesting jobs in PHP or whatever their
> favourite language is. This goes to the "ecosystem of tools" argument
> advanced in the other mail.
> 
> For example, I might have a process that listened to the recent
> changes RSS feed, looked for a tag called "rdf" then went and fetched
> some bit of the dataset (e.g. its voiD [2] description) and then
> updated the package or even created (if necessary) new packages for
> the subsets using the API.
> 
> Or one might do the same thing, but generate a voiD description and
> publish it somewhere (this is the use case that cygri and I discussed
> in London).
> 
> Or listen to the RSS feed and submit some information to an
> aggregation or indexing service.
> 

I agree with you. All those are interesting, very useful, and reasonable 
propositions. And it would be hateful to clutter CKAN up with a 
hundred-and-one variations on a theme.

At the same time, CKAN could have standards-based harvesting 
capabilities without anybody getting too out of shape about it.

The story of how CKAN got its standard-based harvesting capabilities 
(including why the harvesting sub-domain isn't already wrapped up inside 
a separate software application/service), and how that story could be 
continued, is below.

> All of these sorts of things are possible now of course. What I am
> saying is that we should not be encouraging the CKAN codebase to grow
> (in fact I think it should shrink) and when we implement new functions
> it should be done in this sort of way and extensions of the models or
> APIs only contemplated when it appears impossible or very inconvenient
> to accomplish something this way.
> 

I'm curious about your measurement of more and less (through growing or 
shrinking a codebase). I'm also curious about your threshold for when a 
codebase would be too big.

I feel the most important thing is not size, but rather mobility. A 
relatively large codebase that is clean and well tested can be very 
workable. A relatively small codebase that is not clean and is not well 
tested can be totally unworkable.

Story-wise, the harvesting sub-domain model is currently incorporated 
into the CKAN domain model because:

- there is only one CKAN API

- the OKF didn't have an empty harvesting application

- the data.gov.uk Drupal front-end use CKAN for its catalogue service

- hence, the data.gov.uk Drupal front-end uses the CKAN API

- now, the UKLII is being developed into data.gov.uk

- the UKLP is not the simplest multi-lateral programme ever established

- the data.gov.uk site is not the simplest Web site ever established

- the UKLII is required to support distributed metadata

- the UKLII mandates support for harvesting GEMINI 2 documents only

- the UKLII mandates support for harvesting from CSW and WAF sources

- the UKLII mandates support for harvesting on demand

Therefore, the data.gov.uk Drupal front-end needed to put harvest source 
and harvesting jobs into the CKAN API.

Therefore, the harvester was behind the CKAN API.

Therefore, the harvester has access to the CKAN presentation layer.

Therefore, the harvester doesn't need to use the CKAN API (converting 
its CKAN Package dicts to strings, passing them to the API with an API 
key, and then converting them back to CKAN Package dicts is totally 
unnecessary).

Therefore, the harvester uses the CKAN presentation layer.

Of course, from this position, we could establish separate harvester 
services, create a new software application, register it on Pypi, have a 
new repository, have a new virtual environment and configure Apache, fix 
it up with an API key, and so on. We could facade it into the CKAN API, 
so that the data.gov.uk Drupal front-end doesn't need to change.

The harvester is well factored. It hardly has any dependencies on the 
rest of the CKAN codebase. So, by design, separating the harvester into 
a new component would require a relatively minimal effort (relative to a 
more coupled implementation).

Now, we could have firstly created a new and empty software application, 
and then started to build the harvester. But what do you think the 
burndown chart would have looked like? There would have been a 
significant period of time where no functionality that is useful to the 
customer would have been created. For a period of time, all effort would 
have gone into overhead. So, the first achievement would be to undermine 
the customer's confidence, and therefore increase the probability that 
the project would have been cancelled before it has "properly" started.

Anyway, that's not the best way to develop software professionally. :-)

Best wishes,

John.

> Cheers,
> -w
> 
> [1] https://github.com/markbirbeck/xlwrap 
> [2] where DCat is for describing catalogues and datasets, voiD can be
> thought of as a subclass for describing RDF datasets.