[okfn-discuss] We Need Distributed Revision/Version Control for Data

Peter Murray-Rust pm286 at cam.ac.uk
Mon Jul 12 18:35:07 UTC 2010


On Mon, Jul 12, 2010 at 7:08 PM, Rufus Pollock <rufus.pollock at okfn.org>wrote:

> Today I wrote a post on distributed revision/version control for data:
>
>
> http://blog.okfn.org/2010/07/12/we-need-distributed-revisionversion-control-for-data/
>
> I'd be very interested to hear any comments people have, or any useful
> pointers to existing technology.
>
> Regards,
>
> Rufus
>
> I think this is really important, but I think you are right that it
requires domain-specific tools (and indeed I think that any data
repositories will require domain-specific management - diffs are important
but not critical).

I have used normal SCM to store some of my data. My problem is that often
updates takes lots of time even in very little has changed. This is partly
because the data can differ in insignifiacnt ways which still require
formal  diffs. For example if a program recalculates data the new output may
differ in insignificant digits but these mandate that the whole data set is
replaced with a new version.

The main current version of SCM for data is that the data are actually
stored at least once! (Many scientists store the data zero times).

P.


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-discuss/attachments/20100712/32cfa21b/attachment.html>


More information about the okfn-discuss mailing list