[okfn-discuss] Metadata registries too domain specific or too general - how to fill in the gaps?

Jo Walsh jo at frot.org
Wed Oct 17 21:39:32 UTC 2007


(Didn't those use to be known as Content Management Systems, in
the day? What changed?)

Rufus suggested i start braindumping *somewhere*, about *something*
related to data repository / collective metadata management.
I have been suffering from a bit of intellectual constipation recently.
I once spent the years 2003-2005 rewriting the same graph-annotating
application over and over and over. Eventually i tired of that and
began to write variants on it, one of which was an early attempt at a
backend for CKAN, the knowledge package archive project. http://www.ckan.net/

As often, the software stalled half-finished and mutated into something else.
Later, when i picked up the task of making metadata libraries for geodata, 
i was much aware of the conceptual overlap, and hoped to reintegrate later.
Meanwhile the Java OSGeo people would nudge me - "Why don't you just use 
FAO GeoNetwork?" Eventually I learned to overcome some anti-java prejudice.

In the GIS software world there is an overfocus on standards. Given
several different domain models + XML serialisations, etc,
all differingly mandatory in regional government data policies, what
is an implementor to do? The GN answer is to use XML templates (or XML
documents as prototype templates) and to store per-package metadata in
BLOBs of XML in a database - extracting a few key properties for indexing.

But it becomes harder to re-use, and get a 'network effect' from the
re-use of, descriptions of people or organisations; the *internal
structure* of these data sets can't often be expressed by the
prevalent standards (ISO 19115, FGDC, etc). The information models,
domain models for geographic data have a lot of specific details in them 
that I don't necessarily want to fill out, or see on the screen.

I wrote a bit back when about the extrapolation of a "core model"
from the current standards prevalent in geo-metadata; and also about
what i perceive as structural flaws in some of the information models 
designs that are byproducts of a "metadata first, data later" view.
http://wiki.osgeo.org/index.php/Why_DCLite4G
http://frot.org/terradue/minimal_metadata.html

But implementation-wise this needs more than a generic CKAN, though.
Foremost is ability to add spatial properties of data sets - typically
an envelope described by two X,Y points showing what area of space the
data set covers. This bleeds into wanting more things:

- The ability to store and query geometry objects in the backend
  (for the postgres database this is supplied by PostGIS)
- The ability to spatially filter search results from the frontend
- "Semi-automatic" grabbing of a data set and extracting the spatial 
  extents from the metadata in the file (usually possible)
  and post-facto inserting that into the metadata record.

Last year it made sense to DIY with some homegrown geo extensions to 
sqlobject. Then GeoDjango came along and rendered all that obsolete,
and I ported across in a couple of hours. 

But now development-wise I feel i am stuck between two stools.
GeoNetwork is doing a great job on both search UI and on support for
standard "harvesting" protocols like old-school OAI-PMH and
new-fangled OpenSearch. 

One interesting byproject there is the MEF data package format.
This is a structured zipfile containing metadata about the data,
metadata about the contents of the package, potentially accompanying
screenshots and thumbnails, potentially the data itself.
http://frot.org/terradue/explore_mef.html is an excerpt of the GN
manual that describes MEF. Again it is geospatially specific in 
some assumptions about the detail, but this could be a useful way to
think of delivering data from CKAN in something more approaching the 
"apt-get install london" dream. MEF originated as part of an
*interchange protocol between GN nodes* e.g. was a mechanism for
registry/repositories to share data amongst one another.

Now client-side software is getting into the act, e.g. the gvSIG
graphical view and analysis program is getting a plugin that will
generate stubs of MEFs, extract the spatial properties, i *assume*
walk through filling in the more useful/mandatory fields, and POST
the resulting metadata/data package off to a GeoNetwork instance.

This seems to make it less and less worthwhile to replicate what GN
does directly, and more worthwhile to replicate its most successful
interfaces. I start a project to do some of the above in python.
Then i look at CKAN and think about how I'd like to add new query
interfaces to it and contribute directly; being able to "scratch my
own itch" with CKAN would maximise the chance that i commit something ;)

Right now adding data sets to it feels like a drag because there lacks the
capability to import new stuctured metadata records from an existing
repository - something that MEF-likes could help to facilitate - or to
easily dump out records for consumption by a different repository.
It probably makes more sense to help CKAN to do this, rather than work
on a near-clone specifically because it currently can't. 
But a geo-specific near-clone has some of the constraints earmarked above
which leave me in a position of:

- wanting to be able to "plug in" an extended domain model to a 
  given ckan instance (it would be enough, though not ideal, if
  all records in a given repository had to use the same model)

- wanting to "plug in" a query/display protocol to a core

- wanting an easy way to add 'post-create-hooks' to different 
  classes of packages
 
- wanting to contrib useful stuff to the core, not just extensions

I scared myself off DIY frameworks a bit after the experiments with
"nodel". But where that went wrong was the replicating-Django-in-RDF
part of it, rather than the useful bits consisting of "application
packages" of domain models + python modules defining HTTP/XMLish interfaces.

Please tell me if I'm succumbing to frameworkitis again. 
Plus, I am still addicted to GeoDjango to the extent that if i were to
work on a custom distribution of CKAN then i would really want to port
it from pylons to Django first. I know OKFN is platform-neutral, and i
know pylons is ORM-neutral and there has been some PostGIS
integration work with SQLAlchemy, but i consider it likely Django will
provide richer "network effects" in terms of related work.
I wish we didnt still have to have this conversation, either.

cheers,


jo

-- 




More information about the okfn-discuss mailing list