[okfn-discuss] Distributed Storage: Suggestions?
Luis Villa
luis.villa at gmail.com
Thu Apr 23 17:54:04 UTC 2009
On Thu, Apr 23, 2009 at 1:50 PM, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> 2009/4/23 Luis Villa <luis.villa at gmail.com>:
>> On Thu, Apr 23, 2009 at 10:29 AM, <julian at goatchurch.org.uk> wrote:
>>> Quoting Lukasz Szybalski <szybalski at gmail.com>:
>>>> I guess the question would be: Could you describe the
>>> type of data you
>>>> currently have. (percentage of space, downloads, changes)
>>>
>>> This is the directory that has broken the system (watch
>>> out-- it may break your browser):
>>>
>>> http://ukparse.kforge.net/svn/undata/pdf/
>>>
>>> It's several thousand large PDFs of UN documents. The
>>> same would apply to scanned images, archived pages from
>>> Hansard, etc.
>
> [snip]
>
>> The traditional way to handle data sets like this is a combination of
>> http/ftp mirroring; you might ask the fedora people if their
>> mirrormanager code is available. That is very complicated, though- it
>> relies on either active user participation ('select the mirror closest
>> to you') or a variety of other tricks ('we'll try to guess the mirror
>> closest to you') to select mirrors, and requires a combination of
>> software and human screening to monitor whether or not the mirror is
>> actually active, uncorrupted, etc. That said it has worked well for 15
>> years for Linux distros.
>
> This seems to be more oriented to solving the download bandwidth
> problem. While this might become an issue at some point I think the
> first problem is a storage one.
>
>> I think the more forward-thinking way to go is to use bittorrent +
>> some sort of simple script to encourage mirrors to add new files as
>> they are created (i.e., cron + rss + command-line torrent client.) BT
>> is wildly inefficient for files of this size but is the only
>> widely-available/widely-understood p2p tool, handles automagically all
>> the hard parts of ftp/http mirroring (except regularly adding new
>> files) and is, I think, more ideologically appropriate for anyone
>> interested in creating a real knowledge commons than a centralized
>> tool like ftp/http.
>
> Like you I thought about BT a lot when this problem first came up. The
> problem with BT is that it is oriented to solving the b/w problem and
> it isn't a distributed file store. In particular:
>
> * No way to do chunking (BT will chunk automatically the file when
> doing its download/upload but no way to just get a node to keep only a
> part of a file)
>
> * No way to allocate chunks to nodes (each client decides what files
> it is going to hold)
>
> * (Associatedly) Replication of chunks is not built in
>
> * Poor node persistence (BT is oriented to systems where users enter
> and exit rapidly rather than one where nodes are persistent).
>
> That said it might well be possible to build some of this
> infrastructure on top of BT but as it stands it would seem to be quite
> a task. Does anyone know of anyone who has built a distributed storage
> system on top of BT?
Ah, I see how I misunderstood the question. I'm afraid I don't have
any constructive suggestions, sorry.
>> (I long for the day when my home network regularly serves up several
>> gigs of purely legal torrented files every day, reducing the load on
>> community projects I care about. And I wouldn't mind being the first
>> one to have my cable company try to shut me off for it. That'd be all
>> kinds of fun. :)
>
> Great. That means if we get something together we've already got one
> volunteer node :)
When I'm back from bar exam/wedding/honeymoon (i.e., probably
October/November), absolutely. ;)
Luis
More information about the okfn-discuss
mailing list