[okfn-discuss] Metadata registries too domain specific or too general - how to fill in the gaps?

Rufus Pollock rufus.pollock at okfn.org
Fri Oct 19 11:46:15 UTC 2007


Jo Walsh wrote:
[snip]

> In the GIS software world there is an overfocus on standards. Given
> several different domain models + XML serialisations, etc,
> all differingly mandatory in regional government data policies, what
> is an implementor to do? The GN answer is to use XML templates (or XML
> documents as prototype templates) and to store per-package metadata in
> BLOBs of XML in a database - extracting a few key properties for indexing.

Interesting. One would imagine there must be quite a bit of common stuff 
too plus some consensus on what's an absolute minimum (a bit like dublin 
core for document metadata).

> But it becomes harder to re-use, and get a 'network effect' from the
> re-use of, descriptions of people or organisations; the *internal
> structure* of these data sets can't often be expressed by the
> prevalent standards (ISO 19115, FGDC, etc). The information models,
> domain models for geographic data have a lot of specific details in them 
> that I don't necessarily want to fill out, or see on the screen.
> 
> I wrote a bit back when about the extrapolation of a "core model"
> from the current standards prevalent in geo-metadata; and also about
> what i perceive as structural flaws in some of the information models 
> designs that are byproducts of a "metadata first, data later" view.
> http://wiki.osgeo.org/index.php/Why_DCLite4G
> http://frot.org/terradue/minimal_metadata.html

Ah! You're already there so ignore my previous comment :) I note for 
others that the actual spec (which looks nice and concise) is at:

http://wiki.osgeo.org/index.php/DCLite4G

> 
> But implementation-wise this needs more than a generic CKAN, though.
> Foremost is ability to add spatial properties of data sets - typically
> an envelope described by two X,Y points showing what area of space the
> data set covers. This bleeds into wanting more things:
> 
> - The ability to store and query geometry objects in the backend
>   (for the postgres database this is supplied by PostGIS)
> - The ability to spatially filter search results from the frontend
> - "Semi-automatic" grabbing of a data set and extracting the spatial 
>   extents from the metadata in the file (usually possible)
>   and post-facto inserting that into the metadata record.

Okay let's break this down into:

   1. Additional metadata: possible ways to do this in CKAN include
     a) 'machine tags' (as proposed by Aaron) which just involves 
layering on existing tag infrastructure
     b) arbitrary additional per package metadata (should be easy but 
then hard to specify grouping a la dc4g
     c) metadata 'plugins' that allow addition of metadata sets (a la dc4g)

   2. Querying based on metadata. One nice things about plugins is that 
they could provide in addition to the basic metadata spec, metadata 
specific code for the user interface, for querying etc.

   3. Machine automated retrieval of data. This is the (or at least one 
of the major) reasons for building the system. It would be beginning of 
componentization and automated build for knowledge packages similar to 
the way we do software.

> Last year it made sense to DIY with some homegrown geo extensions to 
> sqlobject. Then GeoDjango came along and rendered all that obsolete,
> and I ported across in a couple of hours. 
> 
> But now development-wise I feel i am stuck between two stools.
> GeoNetwork is doing a great job on both search UI and on support for
> standard "harvesting" protocols like old-school OAI-PMH and
> new-fangled OpenSearch. 
> 
> One interesting byproject there is the MEF data package format.
> This is a structured zipfile containing metadata about the data,
> metadata about the contents of the package, potentially accompanying
> screenshots and thumbnails, potentially the data itself.
> http://frot.org/terradue/explore_mef.html is an excerpt of the GN
> manual that describes MEF. Again it is geospatially specific in 
> some assumptions about the detail, but this could be a useful way to
> think of delivering data from CKAN in something more approaching the 
> "apt-get install london" dream. MEF originated as part of an
> *interchange protocol between GN nodes* e.g. was a mechanism for
> registry/repositories to share data amongst one another.

This sounds very interesting. I'd been thinking of stuff along the lines 
python package metadata of good old apt on the basis it was just already 
"there" and could potentially be easily reused. I even started having a 
datapkg tool similar to python's easy_install back in May.

> Now client-side software is getting into the act, e.g. the gvSIG
> graphical view and analysis program is getting a plugin that will
> generate stubs of MEFs, extract the spatial properties, i *assume*
> walk through filling in the more useful/mandatory fields, and POST
> the resulting metadata/data package off to a GeoNetwork instance.

Nice.

> This seems to make it less and less worthwhile to replicate what GN
> does directly, and more worthwhile to replicate its most successful
> interfaces. I start a project to do some of the above in python.
> Then i look at CKAN and think about how I'd like to add new query
> interfaces to it and contribute directly; being able to "scratch my
> own itch" with CKAN would maximise the chance that i commit something ;)

Absolutely. Am i understanding correctly that one could add a new query 
interface to it that goes off and talks to other repositories (such as 
GN?). If so this sounds great and I'd happily help you out in coding 
this up.

> Right now adding data sets to it feels like a drag because there lacks the
> capability to import new stuctured metadata records from an existing
> repository - something that MEF-likes could help to facilitate - or to
> easily dump out records for consumption by a different repository.

Completely agree. Earlier this week I put in support for purging 
revisions (to deal with the occasional spam we're now getting) but next 
item is to provide a good machine API. In fact the system is already 
designed around a quasi-RESTful interface so that a straight POST to:

http://ckan.net/package/create/

and

http://ckan.net/package/update/

with the right variables should work. However something should probably 
be done to make this more completely RESTful. Plus some documentation. 
The relevant current code is at:

http://knowledgeforge.net/ckan/svn/ckan/trunk/ckan/controllers/package.py

> It probably makes more sense to help CKAN to do this, rather than work
> on a near-clone specifically because it currently can't. 

I agree :)

> But a geo-specific near-clone has some of the constraints earmarked above
> which leave me in a position of:
> 
> - wanting to be able to "plug in" an extended domain model to a 
>   given ckan instance (it would be enough, though not ideal, if
>   all records in a given repository had to use the same model)
> 
> - wanting to "plug in" a query/display protocol to a core
> 
> - wanting an easy way to add 'post-create-hooks' to different 
>   classes of packages
>  
> - wanting to contrib useful stuff to the core, not just extensions
> 
> I scared myself off DIY frameworks a bit after the experiments with
> "nodel". But where that went wrong was the replicating-Django-in-RDF
> part of it, rather than the useful bits consisting of "application
> packages" of domain models + python modules defining HTTP/XMLish interfaces.

I think one always wants to use a framework but also remember that it 
only does 10% of the work ...

CKAN uses pylons which is the other major python framework and very 
similar to django so I don't think there should be any problems using it 
if you've used django.

> Please tell me if I'm succumbing to frameworkitis again. 
> Plus, I am still addicted to GeoDjango to the extent that if i were to
> work on a custom distribution of CKAN then i would really want to port
> it from pylons to Django first. I know OKFN is platform-neutral, and i
> know pylons is ORM-neutral and there has been some PostGIS
> integration work with SQLAlchemy, but i consider it likely Django will
> provide richer "network effects" in terms of related work.
> I wish we didnt still have to have this conversation, either.

Now you're asking. This is quite a rewrite since ckan also uses the 
versioned domain model code which is written against sqlobject (and, 
almost, elixir). I note that e.g. bycycle.org use pylons and postgis for 
stuff, see e.g.:

http://bycycle.org/2007/01/29/using-postgis-with-sqlalchemy/

As this is rather technical perhaps this is something we should discuss 
further on okfn-help.

~rufus




More information about the okfn-discuss mailing list