[ckan-dev] Fwd: harvester boilerplate

Fri Oct 10 15:12:15 UTC 2014

I've been trying out some ways to make it easier to write harvesters.
See the conversation below with Adria. Others who have written
harvesters, do chip in with your thoughts about what is good and bad
with the current system.

Dave

---------- Forwarded message ----------
From: Adrià Mercader <adria.mercader at okfn.org>
Date: 10 October 2014 16:00
Subject: Re: harvester boilerplate
To: David Read <david.read at hackneyworkshop.com>

Hi David,

Thanks for this, it looks really close to what I've been thinking the
harvesters need, so great news!

See comments below

On 9 October 2014 16:27, David Read <david.read at hackneyworkshop.com> wrote:

> 1. There is a lot of 'boilerplate' which every harvester needs
> copy-pasted, which it would be good to factor-out. For example moving
> the 'current' flag from the previously harvested HarvestObject to this
> one. Or harvest_object.package_id = package_dict['id']. This is really
> part of the harvest machinery and just serves to confuse new harvester
> writers.
Totally agree. I've made some comments on the ideas repo about
harvesters being difficult to write because of the low-level stuff
that you have take into account. I can't find them now, I guess they
were on this issue:

https://github.com/ckan/ideas-and-roadmap/issues/80

Anyway, I agree in the principle. On the newest harvester [1] I've
written recently I've been trying to factor out internal stuff to
private methods. All these could be moved to HarvesterBase at some
point so people does no longer need to care about them.

I think that as you mentioned below, the import stage could end up
being just a matter of mapping the remote document to a CKAN Dataset
dict and deciding whether you want to create/update it. I think
extension points would be a really good pattern for this, in fact the
spatial harvesters already offer a couple of them to extend the base
ones:

http://docs.ckan.org/projects/ckanext-spatial/en/latest/harvesters.html#customizing-the-harvesters

>
> 2. There are emerging patterns in the writing of harvesters which I
> think are worth factoring out and encouraging people to use:
> a) harvest_object_extra('status') being 'added', 'changed' or 'deleted'.
This is the safest I've found to decide whether a dataset needs
creation/update/deletion but it might be overkill for someone willing
to write a simple harvester that just updates every time, doesn't
handle deletions...

> b) harvest_source configuration affects tags, groups, orgs policy,
> extras, private.
Agree, the CKAN harvester has lots of config options that would be
helpful to other harvesters. Would be good to find a way to apply them
to all one inheriting from HarvesterBase but keeping the ability for
users to tweak their dataset dicts.

> c) I've found it useful to record in package extras basic details to
> identify it as a harvested dataset, source etc. Standardizing this
> info is really useful with CKANs starting to be federated lots.
Sounds good. There are three extras added by default on package_show
(harvest_object_id, harvest_source_id and harvest_source_title), if we
can add to these, great.

> So I've put this together in our fork of ckanext-harvest in
> HarvesterBase. And once all this reusable stuff is factored out, a
> harvester's import stage doesn't need to do much at all - essentially
> just mapping the harvested content into a package_dict. So the way
> I've done it is to put all the boiler-plate in import_stage() in the
> base class, that calls get_package_dict() which you define for your
> actual harvester. Anyway, take a look at the import_stage and
> get_package_dict:
>
> https://github.com/datagovuk/ckanext-harvest/compare/ckan:master...1508-update#diff-5b0c56e24c391eb219cad921b2906b00R267

There's lot of stuff there :) I found perhaps the extension point
approach cleaner than people having to extend a base class, but the
principle is the same.

> NB there's lots of other changes in our fork which you can probably
> ignore for now - it's not a PR yet because of all this extra guff. I'm
> hoping you'll consider taking our HarvesterBase, or at least some of
> the concepts. But let me know what you think.
As I said, I also want the harvesters to move in this direction so
happy to discuss it more and merge anything backwards compatible into
the main ckanext-harvest

BTW, Are you happy to share this on ckan-dev in case someone is interested?

Cheers,

Adrià

[1] https://github.com/okfn/ckanext-sweden/blob/master/ckanext/sweden/dcat/harvester.py