[ckan-discuss] Data Registry Aggregator Experiment

Wed Mar 30 14:19:30 BST 2011

* [2011-03-30 14:01:01 +0200] Friedrich Lindenberg <friedrich.lindenberg at okfn.org> écrit:

] I think we're duplicating effort on a number of projects right now and
] maybe it would be good for us to have an out-of-schedule meetup
] somewhere where James and Rufus explain the strategy behind much of
] the development to us a bit and we try to align the efforts we do in
] other projects, such as LOD2.

This sounds like a very good idea, though I think it should be put
as "we will agree about the broad strokes of development strategy"
rather than have it explained to us.

] (This last one, in particular, would be
] a good place for us to really work on productizing some long-term
] features such as storage, rdf support and non-INSPIRE-mandated
] geo-features. This will not work though, if its out of sync with the
] larger project.)

Certainly, I agree.

] An aggregated SPARQL store is definetly something we need, and we even
] have the server for it (us3.okfn.org). The important thing here is
] that we're no longer just aggregating CKANs but also other catalogues
] and that whatever solution we come up with needs to support this.

Right. This design supports that in the core. As explained in the 
post, the core entity is the repository and a repository can have
different implementations. Right now there are two, MemRepo, a 
simple in-memory one and JsonRepo one that speaks CKAN JSON protocol.
The server works by taking several source repositories and merging
them into a destination repository.

] The current plan was to integrate with the new harvesting framework as
] soon as that has been prototyped by Adrià and to port all the current
] data catalogue harvesters (including CKAN harvesting) to this
] framework ASAP. Some/most of those would still generate RDF and thus
] could be used to load a SPARQL store directly as soon as we have
] working bindings to load e.g. Virtuoso via HTTP from Python.

The HTTP bindings for python have been working and in production for
some time now.

Perhaps the major design difference between ckand and ckan is that
ckand has harvesting as a core concept. In any event the ckan harvesting
framework is developing well and I am working with Adria on this in
the context of UKLII.

] I just don't see demand for another component on this at the moment,

Well this was a research experiment to try out some ideas and wasn't
done in response to any specific demand, except perhaps to fix the
bit-rot on http://semantic.ckan.net/. I'm sure you'll agree that trying
out ideas is something we all should be doing and benefits us all.

] particularly one that requires a new programming language

A new programming language is a relatively small learning curve but
it is true that this is a drawback. However in my opinion this drawback
is outweighed by the productivity gains of static typing, the
architecural gains of built-in concurrency and the operational gains
in terms of speed and memory footprint of a compiled language. This
is a matter of opinion, of course.

] and likely also a new runtime to be installed.

For development you need to install the compiler, of course. There
are debian packages. For deployment it is a single binary and, because
it uses raptor for serialisation, one shared library. It is far
easier to deploy than a python app.

] But we *do* have to deal with these problems and we'll have to support
] multi-placed editing eventually (this is explicitly part of LOD2 and I
] think there is a very real use case with publicdata.eu as well as in
] those places where there is both a government-run and a
] community-supported catalogue (UK, France and Norway at last count).
] We also want it since since much of the data that we will harvest from
] light-weight catalogues (think data.suomi.fi) is incomplete, needs to
] be extended and we still want to be able to pull in updates later on.
]
] If you want to tackle an interesting problem, Will, do this, please!

Let's take this one to the LOD2 list to get an idea of the consensus
of what the requirement actually is. The place to implement merging
is obvious and could easily be added, but the merge algorithm and its
configuration is not trivial. 

This could be interesting to work on. One restriction I will want to
place on the system, at least in the short term, is that the graph
of aggregators must contain no cycles. Otherwise we have to start
doing things like running dijkstra's algorithm over it and this would
introduce protocol complications.

] I fully agree on this one (although we can haggle with Will over
] whether we want to do flat files or just generate on the fly, our
] estimate of the cost seemed different the last time we discussed
] that). We need content auto-negotiation and we should support pushing
] out RDF from within CKAN (or from a plugin, to be more precise). The
] right time to tackle this, though, is probably after the dictization
] has been done.

Agreed. Flat-files or on the fly generation is an implementation
detail. Accepting augmented metadata from multiple sources is an
architectural one. How we handle this could be related to the above
considerations about merging and federation.

Cheers,
-w

-- 
William Waites                <mailto:ww at styx.org>
http://river.styx.org/ww/        <sip:ww at styx.org>
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45