[okfn-discuss] Collaborative Development of Data
rufus.pollock at okfn.org
Thu Feb 15 13:03:22 GMT 2007
Prompted by our recent emails I've finally cleaned up a draft on the
Collaborative Development of Data which I first began work on early last
year. In it I raise various questions about the appropriateness of wikis
when working with information that is not structured text (unless one
interprets wikis simply to be a synonym for any system that has
versioning and a web interface -- in which case pretty much any system
one develops for collaborative development would count as a wiki).
Collaborative Development of Data
$ 2007-02-15 (First version 2006-05-24) $
We already have some fairly good working processes for collaborative
development of unstructured text: the two most prominent examples being
source code of computer programs and wikis for general purpose content
(encyclopedias etc). However these tools perform poorly (or not at all)
when we come to structured data.
The purpose of this document is to pose the question: how do we
collaboratively develop knowledge when it is in the form of structured
data (as opposed to unstructured text)?
There are two aspects of structured data that distinguish it from plain
1. Referential integrity (objects point to other objects)
2. Labelling to enable machine processing (and the addition of
To illustrate what I mean consider the following use case which comes
from our own [public domain works project][pdw]. Here we are storing
data about cultural works. In the simplest possible setup we have two
types of object: a work and a creator. A given work may have many
creators (authors) and a given creator may have created many works.
Furtheremore each work and each creator have various attributes. For the
purposes of this discussion let us focus on only two:
1. name (creator) and title (work)
2. date of death (creator) and date of creation (work)
If we were to adopt a wiki setup ([a la wikipedia][wiki_beethoven]) we
would create a web page for each creator and each work. There would be a
url pointing to any associated objects with some kind of
human-processable (but likely not machine-processable) indicator of the
nature of the link. Attributes would also be included as plain text
perhaps with some simple markup to indicate their nature perhaps not.
The unique identifier for a given object would come in the form of a url.
This is a not unattractive approach as it is very easy to implement --
at least initially -- because wikis for plain text are so well developed
(and in fact it is the approach taken by the current v0.1 of public
domain works). The problem arise when once one goes beyond simple data
entry. For example
1. Searching, particularly structured searching (e.g. find more all
creators who died more than seventy years ago and whose works are more
than 100 years old), is slow and cumbersome compared to working with a
database. Referential integrity isn't enforced and the unique
identifiers (url names) aren't
2. Programmatic insertion and querying of the data is very limited. For
example suppose we obtain a library catalogue and wish to merge it into
the existing data. To do this we need to query the existing db
repeatedly to try and identify matches between existing objects and
objects in the catalogue.
3. No support for ACID, in particular no way:
1. To have (and enforce) referential integrity in your data
2. To do atomic commits which preserve referential integrity (even in
a simple wiki this is a problem in that renaming a page and changing
references to it have to be separate operations rather than one atomic
4. 'Data loss'/No data structure: when data structure isn't "enforced"
it may be extremely (or impossible) to extra relevant information (e.g.
date of death in above example). In such circumstances, at least from a
programmer's point of view, the data is now 'lost'. It also makes it
much harder to enforce data constraints when data is entered or to check
data validity once entered.
Thus we really want an approach that supports:
1. Versioning at the *model* level (i.e. not just of individual
2. Other data types than plain text
3. Associated tools:
* No off-the-shelf tools that will version
* No off-the-shelf tools to do visualization (e.g. showing diffs)[^1]
* Web interface to provide for direct editing (and integration of
associated tools such as diffs, changelogs etc)
* Programmatic API to access data
The obvious way to proceed with this is to develop 'versioned domain
models'. That is to develop traditional software-based or database-based
'domain model' which can then be versioned. This would be very similar
to the way that subversion models a filesystem and then adds versioning
of that filesystem[^3][^4].
[^wiktionaryz]: The wikitionaryz project (now renamed OmegaWiki) have
been working on integrating referential intergrity into a wiki-like
[^1]: there are a bunch of (pre 1.0 AFAICT) tools for doing diffs on xml
data. See e.g. <http://www.logilab.org/projects/xmldiff>
[^2]: see <http://www.martinfowler.com/ap2/timeNarrative.html> for
software patterns for objects that change with time. There is also a
*very* extensive book on the topic on time-oriented db applications in
sql by the father of the temporal parts of sql3:
[^3]: the subversion model can best be gleaned from its API. A pythonic
version of that API can be seen in:
[^4]: http://www.musicbrainz.org/ already go some way towards having a
versioned domain model in relation to music and its creators.
More information about the okfn-discuss