[ckan-discuss] Data Registry Aggregator Experiment

Wed Mar 30 12:30:51 BST 2011

On 30 March 2011 10:42, William Waites <ww at styx.org> wrote:
> * [2011-03-30 10:06:45 +0100] David Read <david.read at okfn.org> écrit:
>
> ] Great to see another way of getting and serving package RDF. I agree
> ] that using Go or other lower level language suits this particular use
> ] case when optimising for speed.
>
> Not sure I agree with Go as a lower level language. Having now written
> something "real" with it, I would put it effectively in the higher
> level language category like Python or Ruby, only that there aren't
> the zillions of libraries.

We must all have a ... Go with it!

> ] Are your customers asking you to put the aggregated RDF in a store and
> ] providing a sparql service?
>
> But the users of the RDF will often like it to be in a SPARQL store as
> well. This is easily accomplished just by telling ckand to produce a
> dump (SIGUSR1) and then importing that dump into the store.

Cool. Just wondered about the scope/plans you had in mind.

> ] How does this fit with the rest of the team's existing efforts in CKAN
> ] data aggregation? I'm thinking of the dcat stuff, repo syncing and
> ] aggregated search?
>
> Now this is an interesting question. It obviously occupies some of the
> same space as the first two and though it doesn't have any particular
> search facilities, it would provide a logical place from which an
> index could be built.

My point was - let's try not to duplicate effort here. We've got three
sorts of CKAN aggregators already - it sounds like this needs a
discussion before too many more are written!

> With respect to repo synching, at the moment it is just a read-only
> aggregator. If write functions were added in, any record that is
> written to would still have a "home" server and would get written
> there. Everywhere else would just have a read only copy. So if the
> network is such things is complicated there is a choice to be made
> about eventual consistency versus effectively a cache-poisoning
> problem, but we would be guaranteed that on the path between the
> client doing the update and the home server all records will be up to
> date, so somebody doing a write-read cycle see up-to-date data. This
> means we don't have to address the very complex question of multiple
> writes in multiple places, but also means that in the event of a
> network partition, a write can fail, but a read will still work
> serving last-known good data. This all basically comes along for free
> and just falls out of the design.
>
> What would need some thought is how to handle authentication.  Do the
> user credentials, e.g. the X-CKAN-API headers get passed along
> (meaning a user would have to have an account and know a potentially
> large number of API keys for each "home" server) or do we permit the
> edge closest to the user to be where they have their account and
> authenticate them in some way and then use a set of api keys for
> server- server use?
>
> ] And looking at it from the reverse direction, (since we know data
> ] communities love their independent identity too) could the RDF
> ] function be added into a CKAN extension, say?
>
> A small ckan extension that properly handled content-type
> autonegotiation would actually be quite useful for this and other
> things. It could have, e.g. a small 303 redirect controller and hook
> the routes in certain cases. This would work well with ckand or the
> existing semantic.ckan.net installation (and former
> catalogue.data.gov.uk) which is just a directory tree of flat files.

We already have a 303 redirect in the CKAN core, with ckan.net setup
for http://semantic.ckan.net/package/ . I'm kind of wondering if it
would not be better to have the RDF produced in a CKAN extension and
therefore served by CKAN as an alternative format for the package to
JSON? This solves the problem of keeping up to date, purged & deleted
packages, permissions etc. which you have with semantic.ckan.net I
guess? These are not currently major issues, but are the sort of
things which crop up when you structure the RDF creation outside the
CKAN framework. Or is there serious value in keeping it separate?

Dave