[okfn-help] Access control and permissions on a tahoe grid

Julian Todd julian at goatchurch.org.uk
Thu Jun 25 01:50:05 BST 2009


==Tahoe grid use cases for OKFN==

I have several distinct examples of data which would better be stored
in the grid, than currently in the kforge SVN system.


(1) Large quantity of webscraped United Nations documents in PDF form
(both text and scanned):
http://knowledgeforge.net/ukparse/svn/trunk/undata/pdf/

(2) pdf2xml versions that have been corrected for typos and other
problems: http://knowledgeforge.net/ukparse/svn/trunk/undata/pdfxml/
 (We are not interested in the intermediate versions -- just the
original (which can be obtained by calling pdf2xml again) and the
final edit that parses.)

(3) Large webscraped HTML from UK Parliament which is saved under
different versions that all need to be accessible (-a, -b, etc)
  http://knowledgeforge.net/ukparse/svn/trunk/parldata/cmpages/

(4) Scanned hand-written maps, field notes and pages from logbooks
  http://knowledgeforge.net/sesame/club/mmmmc/Ireby%20Fell%20Cavern/rawscans/


==Why SVN is inappropriate==

SVN was made for code-bases.  But it's been used on kforge for hosting
files of the types listed above, because it manages (a) the backups,
(b) syncronizing the data into other people's directories.

However, the versioning and diff features for such files are entirely
counterproductive, and costly by the way it stores an unnecessary
duplicate of every file that has been checked out.

It's easy to forget what a special application coding is, and why
versioning works for it, but doesn't for documents of the listed sort.

With code, you (and other coders) make simultaneous batches of files.
It's considered wrong if you check in code that doesn't compile -- ie
all the links between the files changed consistently.  That's because
all the links are internal to the project.

With data, the links are coming from the outside.  So you cannot
change a document, whilst keeping its identity (name) the same.  For
legal documents, that means your incoming references will break, so
you need to publish amendments or revisions and keep all versions
available.

With scans of paper evidence, although you could revise them by
rescanning at a better resolution, you can't do this because if
anything refers to excerpts of these scans by their pixel coordinates,
they'd be broken.  So these also can't be versioned like code.


==Use cases for Tahoe grid==

We get the same two features of backing up, and sychronising the new
stuff (not file diffs) to a different computer.


(case 1) The whole repository of a particular type is all copied down
into the server.  Tahoe is used as a backup and moving of the server's
data.  Everything is available for processing across all of the data.
The server can add new files to the repository.

This case applies to undemocracy and parlparse because all the data is
needed in order to present users with statistical figures (eg
attendance rates in votes).


(case 2) Files are served directly out of Tahoe through a server.  The
full repository is not copied down.  The server merely caches some
files.  More useful for delivering the PDF documents or images where
statistical analysis are not always wanted.

JT



More information about the okfn-help mailing list