[okfn-discuss] Distributed Storage: Suggestions?

Rufus Pollock rufus.pollock at okfn.org
Thu Apr 23 17:50:08 UTC 2009


2009/4/23 Luis Villa <luis.villa at gmail.com>:
> On Thu, Apr 23, 2009 at 10:29 AM,  <julian at goatchurch.org.uk> wrote:
>> Quoting Lukasz Szybalski <szybalski at gmail.com>:
>>> I guess the question would be: Could you describe the
>> type of data you
>>> currently have. (percentage of space, downloads, changes)
>>
>> This is the directory that has broken the system (watch
>> out-- it may break your browser):
>>
>> http://ukparse.kforge.net/svn/undata/pdf/
>>
>> It's several thousand large PDFs of UN documents.  The
>> same would apply to scanned images, archived pages from
>> Hansard, etc.

[snip]

> The traditional way to handle data sets like this is a combination of
> http/ftp mirroring; you might ask the fedora people if their
> mirrormanager code is available. That is very complicated, though- it
> relies on either active user participation ('select the mirror closest
> to you') or a variety of other tricks ('we'll try to guess the mirror
> closest to you') to select mirrors, and requires a combination of
> software and human screening to monitor whether or not the mirror is
> actually active, uncorrupted, etc. That said it has worked well for 15
> years for Linux distros.

This seems to be more oriented to solving the download bandwidth
problem. While this might become an issue at some point I think the
first problem is a storage one.

> I think the more forward-thinking way to go is to use bittorrent +
> some sort of simple script to encourage mirrors to add new files as
> they are created (i.e., cron + rss + command-line torrent client.) BT
> is wildly inefficient for files of this size but is the only
> widely-available/widely-understood p2p tool, handles automagically all
> the hard parts of ftp/http mirroring (except regularly adding new
> files) and is, I think, more ideologically appropriate for anyone
> interested in creating a real knowledge commons than a centralized
> tool like ftp/http.

Like you I thought about BT a lot when this problem first came up. The
problem with BT is that it is oriented to solving the b/w problem and
it isn't a distributed file store. In particular:

* No way to do chunking (BT will chunk automatically the file when
doing its download/upload but no way to just get a node to keep only a
part of a file)

* No way to allocate chunks to nodes (each client decides what files
it is going to hold)

* (Associatedly) Replication of chunks is not built in

* Poor node persistence (BT is oriented to systems where users enter
and exit rapidly rather than one where nodes are persistent).

That said it might well be possible to build some of this
infrastructure on top of BT but as it stands it would seem to be quite
a task. Does anyone know of anyone who has built a distributed storage
system on top of BT?

> (I long for the day when my home network regularly serves up several
> gigs of purely legal torrented files every day, reducing the load on
> community projects I care about. And I wouldn't mind being the first
> one to have my cable company try to shut me off for it. That'd be all
> kinds of fun. :)

Great. That means if we get something together we've already got one
volunteer node :)

Rufus




More information about the okfn-discuss mailing list