[okfn-discuss] Metadata registries too domain specific or too general - how to fill in the gaps?
Rufus Pollock
rufus.pollock at okfn.org
Fri Oct 19 11:46:15 UTC 2007
Jo Walsh wrote:
[snip]
> In the GIS software world there is an overfocus on standards. Given
> several different domain models + XML serialisations, etc,
> all differingly mandatory in regional government data policies, what
> is an implementor to do? The GN answer is to use XML templates (or XML
> documents as prototype templates) and to store per-package metadata in
> BLOBs of XML in a database - extracting a few key properties for indexing.
Interesting. One would imagine there must be quite a bit of common stuff
too plus some consensus on what's an absolute minimum (a bit like dublin
core for document metadata).
> But it becomes harder to re-use, and get a 'network effect' from the
> re-use of, descriptions of people or organisations; the *internal
> structure* of these data sets can't often be expressed by the
> prevalent standards (ISO 19115, FGDC, etc). The information models,
> domain models for geographic data have a lot of specific details in them
> that I don't necessarily want to fill out, or see on the screen.
>
> I wrote a bit back when about the extrapolation of a "core model"
> from the current standards prevalent in geo-metadata; and also about
> what i perceive as structural flaws in some of the information models
> designs that are byproducts of a "metadata first, data later" view.
> http://wiki.osgeo.org/index.php/Why_DCLite4G
> http://frot.org/terradue/minimal_metadata.html
Ah! You're already there so ignore my previous comment :) I note for
others that the actual spec (which looks nice and concise) is at:
http://wiki.osgeo.org/index.php/DCLite4G
>
> But implementation-wise this needs more than a generic CKAN, though.
> Foremost is ability to add spatial properties of data sets - typically
> an envelope described by two X,Y points showing what area of space the
> data set covers. This bleeds into wanting more things:
>
> - The ability to store and query geometry objects in the backend
> (for the postgres database this is supplied by PostGIS)
> - The ability to spatially filter search results from the frontend
> - "Semi-automatic" grabbing of a data set and extracting the spatial
> extents from the metadata in the file (usually possible)
> and post-facto inserting that into the metadata record.
Okay let's break this down into:
1. Additional metadata: possible ways to do this in CKAN include
a) 'machine tags' (as proposed by Aaron) which just involves
layering on existing tag infrastructure
b) arbitrary additional per package metadata (should be easy but
then hard to specify grouping a la dc4g
c) metadata 'plugins' that allow addition of metadata sets (a la dc4g)
2. Querying based on metadata. One nice things about plugins is that
they could provide in addition to the basic metadata spec, metadata
specific code for the user interface, for querying etc.
3. Machine automated retrieval of data. This is the (or at least one
of the major) reasons for building the system. It would be beginning of
componentization and automated build for knowledge packages similar to
the way we do software.
> Last year it made sense to DIY with some homegrown geo extensions to
> sqlobject. Then GeoDjango came along and rendered all that obsolete,
> and I ported across in a couple of hours.
>
> But now development-wise I feel i am stuck between two stools.
> GeoNetwork is doing a great job on both search UI and on support for
> standard "harvesting" protocols like old-school OAI-PMH and
> new-fangled OpenSearch.
>
> One interesting byproject there is the MEF data package format.
> This is a structured zipfile containing metadata about the data,
> metadata about the contents of the package, potentially accompanying
> screenshots and thumbnails, potentially the data itself.
> http://frot.org/terradue/explore_mef.html is an excerpt of the GN
> manual that describes MEF. Again it is geospatially specific in
> some assumptions about the detail, but this could be a useful way to
> think of delivering data from CKAN in something more approaching the
> "apt-get install london" dream. MEF originated as part of an
> *interchange protocol between GN nodes* e.g. was a mechanism for
> registry/repositories to share data amongst one another.
This sounds very interesting. I'd been thinking of stuff along the lines
python package metadata of good old apt on the basis it was just already
"there" and could potentially be easily reused. I even started having a
datapkg tool similar to python's easy_install back in May.
> Now client-side software is getting into the act, e.g. the gvSIG
> graphical view and analysis program is getting a plugin that will
> generate stubs of MEFs, extract the spatial properties, i *assume*
> walk through filling in the more useful/mandatory fields, and POST
> the resulting metadata/data package off to a GeoNetwork instance.
Nice.
> This seems to make it less and less worthwhile to replicate what GN
> does directly, and more worthwhile to replicate its most successful
> interfaces. I start a project to do some of the above in python.
> Then i look at CKAN and think about how I'd like to add new query
> interfaces to it and contribute directly; being able to "scratch my
> own itch" with CKAN would maximise the chance that i commit something ;)
Absolutely. Am i understanding correctly that one could add a new query
interface to it that goes off and talks to other repositories (such as
GN?). If so this sounds great and I'd happily help you out in coding
this up.
> Right now adding data sets to it feels like a drag because there lacks the
> capability to import new stuctured metadata records from an existing
> repository - something that MEF-likes could help to facilitate - or to
> easily dump out records for consumption by a different repository.
Completely agree. Earlier this week I put in support for purging
revisions (to deal with the occasional spam we're now getting) but next
item is to provide a good machine API. In fact the system is already
designed around a quasi-RESTful interface so that a straight POST to:
http://ckan.net/package/create/
and
http://ckan.net/package/update/
with the right variables should work. However something should probably
be done to make this more completely RESTful. Plus some documentation.
The relevant current code is at:
http://knowledgeforge.net/ckan/svn/ckan/trunk/ckan/controllers/package.py
> It probably makes more sense to help CKAN to do this, rather than work
> on a near-clone specifically because it currently can't.
I agree :)
> But a geo-specific near-clone has some of the constraints earmarked above
> which leave me in a position of:
>
> - wanting to be able to "plug in" an extended domain model to a
> given ckan instance (it would be enough, though not ideal, if
> all records in a given repository had to use the same model)
>
> - wanting to "plug in" a query/display protocol to a core
>
> - wanting an easy way to add 'post-create-hooks' to different
> classes of packages
>
> - wanting to contrib useful stuff to the core, not just extensions
>
> I scared myself off DIY frameworks a bit after the experiments with
> "nodel". But where that went wrong was the replicating-Django-in-RDF
> part of it, rather than the useful bits consisting of "application
> packages" of domain models + python modules defining HTTP/XMLish interfaces.
I think one always wants to use a framework but also remember that it
only does 10% of the work ...
CKAN uses pylons which is the other major python framework and very
similar to django so I don't think there should be any problems using it
if you've used django.
> Please tell me if I'm succumbing to frameworkitis again.
> Plus, I am still addicted to GeoDjango to the extent that if i were to
> work on a custom distribution of CKAN then i would really want to port
> it from pylons to Django first. I know OKFN is platform-neutral, and i
> know pylons is ORM-neutral and there has been some PostGIS
> integration work with SQLAlchemy, but i consider it likely Django will
> provide richer "network effects" in terms of related work.
> I wish we didnt still have to have this conversation, either.
Now you're asking. This is quite a rewrite since ckan also uses the
versioned domain model code which is written against sqlobject (and,
almost, elixir). I note that e.g. bycycle.org use pylons and postgis for
stuff, see e.g.:
http://bycycle.org/2007/01/29/using-postgis-with-sqlalchemy/
As this is rather technical perhaps this is something we should discuss
further on okfn-help.
~rufus
More information about the okfn-discuss
mailing list