[okfn-discuss] Collaborative Development of Data

Mon Feb 19 16:03:27 UTC 2007

Benj. Mako Hill wrote:
> <quote who="Rufus Pollock" date="Thu, Feb 15, 2007 at 01:03:22PM +0000">
>> We already have some fairly good working processes for collaborative 
>> development of unstructured text: the two most prominent examples being 
>> source code of computer programs and wikis for general purpose content 
>> (encyclopedias etc). However these tools perform poorly (or not at all) 
>> when we come to structured data.
> 
> If you know how to massage the system, both Mediawiki (especially with a
> series of plugins) or MoinMoin are pretty good at this as well.

Hmmm. Depends how you are using them. Of course, /you can/ start 
embedding sql queries or forms into them and having a backend database 
but then you are essentially using them as a 'web framework' to help 
with theming, user management etc etc (and they are no different -- and 
perhaps worse -- than all the other web frameworks out there such as 
rubyonrails, django, pylons, turbogears ...).

Once you start having structured data you want to start from a system 
and an underlying domain model that respects that -- you don't want to 
start from something that is designed to handle 'content' (that is 
unstructured human-readable text).

The key point that lay behind my original post is that:

The crucial feature for collaborative development of *anything* is 
versioning.

I then discussed how one would support this feature in relation to 
structured data. One possibility is just to store the data as plain text 
(perhaps with some agreed formatting) in a wiki. This idea comes to mind 
because versioning is a major feature of wikis and is part of what makes 
them good. However it is also implemented (and implementable) in lots of 
other areas in which wikis would not be a good idea. I then suggested 
that if you want to version data then you would indeed want something 
different.

Perhaps a summary 'table' will help:

Wiki:

   * Content type: processing by a human (usually text)
   * Versioning: simple (per page) with associated features (recent 
changes, history)
   * Web interface: yes (often only interface)
   * Write access: open (often anyone can write)

Code:

   * Content type: human readable *and* machine readable
   * Versioning: sophisticated (atomic, centralized and decentralized)
   * Web interface: maybe (most development is done locally and then 
commited)
   * Write access: closed (limited commit access as code is fragile)

Structured Data:

   * Content type: typed data with possibility of references
   * Versioning: sophisticated would be nice
   * Web interface: would be nice but not essential
   * Write access: ?

> My biggest problem is that the syntax necessary to show where the data
> is (labeling in your description) is not something people always get
> right -- in fact, they very frequently get it wrong.

Exactly. This is the validation point. There is a reason that most of 
development of systems that handle structured data (and hey this is most 
of enterprise software) start from a domain model which implements this 
kind of core logic (this is an integer, that is a string which should be 
a valid email address etc etc).

> Now, there is syntax in wikis for documents too of course. But when you
> get it wrong, it's not usually so bad. While getting syntax wrong in a
> document may make your document a bit ugly, it's frequently noticeable
> and only infrequently impacts your meaning.

Absolutely. Wikis were designed to handle text that would be processed 
by humans and it wouldn't matter if it was a bit wrong (this also 
explains why most people don't develop software in 'wikis' because 
errors in the code lead to stuff not compiling/running).

> But when you have a wide-open text box for data, screw-ups can be both
> much more difficult to detect (both for the computer, and for a human
> reading the page) and the impact is often that the data is unseen in
> other parts of the system.

bugs both of the basic and the more sophisticated kind become much more 
costly ... That is why you want to start developing a proper domain model.

> The comparison between Microsoft Word and Microsoft Access with its form
> wizards is a useful analogy perhaps. Wikis, as they exist currently, do
> a pretty good job of addressing the first class of problems but do a
> pretty poor job (as of right now and as I understand it) of addressing
> the types of problems that Access does.
> 
> What's great about Access is the interface is flexible and, once set up,
> you can make it difficult for users to add bad or invalid data by
> accident. It does most of this, of course through interface rather than
> validation -- this is reason people find such system so usable. I've
> always been sad that I've never seen a great piece of free software that
> did that same thing as well. Of course, this piece of free software
> should be collaborative as well, and that introduces lots of other
> problems.

I think this is a nice analogy but I think we should be careful here. 
The core thing is what the underlying domain model supports not what the 
front-end looks like. Most web applications are backed by a proper 
domain model (and behind that a database) but might have very simple and 
intuitive interfaces (e.g. del.icio.us). In fact can implement a proper 
versioned domain model so that its interface looks 'like' a wiki. You 
can also start putting proper object oriented structures into a wiki (as 
I think OmegaWiki are doing) but in so doing you aren't really a wiki 
any more (of if you are what one means by a wiki has become so broad as 
to encompass pretty much anything).

> Perhaps, I'll write one some day. I think that such a project could
> learn a whole lot from wikis. However, I think there's a danger that we
> could "learn" a bit too much and not diverge in ways that will be
> essential to the success of such a project.

I completely agree. What I personally take from wikis is that:

   * open write access (so v. low barrier to entry) can work really well
   * you can successfully port existing ideas such as versioning to 
other areas (in this case from code to content)

> Maybe such tools exist and I just don't know about them. That would be
> very exciting indeed!

I too would love to hear about them. As a complement to this discussion 
I'm sketching out a generic versioned domain model implementation for 
use in ckan (http://www.ckan.net). You can find some demo code at:

http://www.rufuspollock.org/code/vdm/
http://www.rufuspollock.org/code/vdm/README.txt

Regards,

Rufus