[okfn-discuss] Collaborative Development of Data

Thu Feb 15 13:03:22 UTC 2007

Dear Erik,

Prompted by our recent emails I've finally cleaned up a draft on the 
Collaborative Development of Data which I first began work on early last 
year. In it I raise various questions about the appropriateness of wikis 
when working with information that is not structured text (unless one 
interprets wikis simply to be a synonym for any system that has 
versioning and a web interface -- in which case pretty much any system 
one develops for collaborative development would count as a wiki).

Regards,

Rufus

Collaborative Development of Data
=================================

$ 2007-02-15 (First version 2006-05-24) $

We already have some fairly good working processes for collaborative 
development of unstructured text: the two most prominent examples being 
source code of computer programs and wikis for general purpose content 
(encyclopedias etc). However these tools perform poorly (or not at all) 
when we come to structured data.

The purpose of this document is to pose the question: how do we 
collaboratively develop knowledge when it is in the form of structured 
data (as opposed to unstructured text)?

There are two aspects of structured data that distinguish it from plain 
text:

   1. Referential integrity (objects point to other objects)
   2. Labelling to enable machine processing (and the addition of 
'semantics')

To illustrate what I mean consider the following use case which comes 
from our own [public domain works project][pdw]. Here we are storing 
data about cultural works. In the simplest possible setup we have two 
types of object: a work and a creator. A given work may have many 
creators (authors) and a given creator may have created many works. 
Furtheremore each work and each creator have various attributes. For the 
purposes of this discussion let us focus on only two:

   1. name (creator) and title (work)
   2. date of death (creator) and date of creation (work)

If we were to adopt a wiki setup ([a la wikipedia][wiki_beethoven]) we 
would create a web page for each creator and each work. There would be a 
url pointing to any associated objects with some kind of 
human-processable (but likely not machine-processable) indicator of the 
nature of the link. Attributes would also be included as plain text 
perhaps with some simple markup to indicate their nature perhaps not. 
The unique identifier for a given object would come in the form of a url.

[wiki_beethoven]: http://en.wikipedia.org/wiki/Ludwig_van_Beethoven

This is a not unattractive approach as it is very easy to implement -- 
at least initially -- because wikis for plain text are so well developed 
(and in fact it is the approach taken by the current v0.1 of public 
domain works). The problem arise when once one goes beyond simple data 
entry. For example

1. Searching, particularly structured searching (e.g. find more all 
creators who died more than seventy years ago and whose works are more 
than 100 years old), is slow and cumbersome compared to working with a 
database. Referential integrity isn't enforced and the unique 
identifiers (url names) aren't

2. Programmatic insertion and querying of the data is very limited. For 
example suppose we obtain a library catalogue and wish to merge it into 
the existing data. To do this we need to query the existing db 
repeatedly to try and identify matches between existing objects and 
objects in the catalogue.

3. No support for ACID, in particular no way:
   1. To have (and enforce) referential integrity in your data 
structures [^wiktionaryz]
   2. To do atomic commits which preserve referential integrity (even in 
   a simple wiki this is a problem in that renaming a page and changing 
references to it have to be separate operations rather than one atomic 
commit)

4. 'Data loss'/No data structure: when data structure isn't "enforced" 
it may be extremely (or impossible) to extra relevant information (e.g. 
date of death in above example). In such circumstances, at least from a 
programmer's point of view, the data is now 'lost'. It also makes it 
much harder to enforce data constraints when data is entered or to check 
data validity once entered.

Thus we really want an approach that supports:

   1. Versioning at the *model* level (i.e. not just of individual 
attributes)
   2. Other data types than plain text
   3. Associated tools:
     * No off-the-shelf tools that will version
     * No off-the-shelf tools to do visualization (e.g. showing diffs)[^1]
     * Web interface to provide for direct editing (and integration of 
associated tools such as diffs, changelogs etc)
     * Programmatic API to access data

The obvious way to proceed with this is to develop 'versioned domain 
models'. That is to develop traditional software-based or database-based 
'domain model' which can then be versioned. This would be very similar 
to the way that subversion models a filesystem and then adds versioning 
of that filesystem[^3][^4].

[pdw]: http://www.publicdomainworks.net/

[^wiktionaryz]: The wikitionaryz project (now renamed OmegaWiki) have 
been working on integrating referential intergrity into a wiki-like 
interface.

[^1]: there are a bunch of (pre 1.0 AFAICT) tools for doing diffs on xml 
data. See e.g. <http://www.logilab.org/projects/xmldiff>

[^2]: see <http://www.martinfowler.com/ap2/timeNarrative.html> for 
software patterns for objects that change with time. There is also a 
*very* extensive book on the topic on time-oriented db applications in 
sql by the father of the temporal parts of sql3: 
<http://www.cs.arizona.edu/people/rts/tdbbook.pdf>

[^3]: the subversion model can best be gleaned from its API. A pythonic 
version of that API can be seen in:

       http://www.rufuspollock.org/code/svnrepo/svnrepo.py

[^4]: http://www.musicbrainz.org/ already go some way towards having a 
versioned domain model in relation to music and its creators.