[okfn-discuss] Distributed Storage: Suggestions?

Luis Villa luis.villa at gmail.com
Thu Apr 23 14:48:01 UTC 2009


On Thu, Apr 23, 2009 at 10:29 AM,  <julian at goatchurch.org.uk> wrote:
> Quoting Lukasz Szybalski <szybalski at gmail.com>:
>
>> I guess the question would be: Could you describe the
> type of data you
>> currently have. (percentage of space, downloads, changes)
>>
>
> This is the directory that has broken the system (watch
> out-- it may break your browser):
>
> http://ukparse.kforge.net/svn/undata/pdf/
>
> It's several thousand large PDFs of UN documents.  The
> same would apply to scanned images, archived pages from
> Hansard, etc.
>
>
> At the moment I'm storing it in SVN as a means of
> distribution, but it unnecessarily doubles the disk
> useage, and some of the SVN clients are very unhappy
> with the size of the directory.
>
>
> SVN is entirely inappropriate for these large binary
> files (there are no versions), but it's convenient only
> because the code that handles these binary files are in
> SVN (where they belong), and the fewer means of
> distribution the better.  But it's not scaling any more.
>
>
> We need a better answer for parking the data for these
> projects, where we'd keep the scraping/parsing code in
> SVN on kforge (SVN is designed for code), and handle
> these large sets of large non-versioned files some other
> way.

The traditional way to handle data sets like this is a combination of
http/ftp mirroring; you might ask the fedora people if their
mirrormanager code is available. That is very complicated, though- it
relies on either active user participation ('select the mirror closest
to you') or a variety of other tricks ('we'll try to guess the mirror
closest to you') to select mirrors, and requires a combination of
software and human screening to monitor whether or not the mirror is
actually active, uncorrupted, etc. That said it has worked well for 15
years for Linux distros.

I think the more forward-thinking way to go is to use bittorrent +
some sort of simple script to encourage mirrors to add new files as
they are created (i.e., cron + rss + command-line torrent client.) BT
is wildly inefficient for files of this size but is the only
widely-available/widely-understood p2p tool, handles automagically all
the hard parts of ftp/http mirroring (except regularly adding new
files) and is, I think, more ideologically appropriate for anyone
interested in creating a real knowledge commons than a centralized
tool like ftp/http.

(I long for the day when my home network regularly serves up several
gigs of purely legal torrented files every day, reducing the load on
community projects I care about. And I wouldn't mind being the first
one to have my cable company try to shut me off for it. That'd be all
kinds of fun. :)

Luis




More information about the okfn-discuss mailing list