[open-science] BioTorrents

Rufus Pollock rufus.pollock at okfn.org
Fri Apr 16 20:41:38 UTC 2010


On 16 April 2010 20:03, Jonathan Gray <jonathan.gray at okfn.org> wrote:
> Piece about BioTorrents on Nature blog:
>
>  http://blogs.nature.com/news/thegreatbeyond/2010/04/improving_the_portability_of_d_1.html

There used to be something like this for geodata (geotorrents.org) but
it has now disappeared. We've thought about torrent stuff quite a lot
before for data distribution [1] [2]. The problem with bittorrent (at
least in the small-time experiments we did) is it provides no
mechanism to do the storage-allocation you need for a real data
"grid", specifically:

a) How do you deal with large files (GBs) which individual peers may
not want to be responsible for holding and sharing in their entirety.
The obvious answer is sharding but bittorrent has no way for sharding
parts of a given file (so you need some mechanism above bittorrent to
do this)

b) (the biggie) how do you file/load allocation and rebalancing to
ensure you don't lose data as peers enter and leave the network. Even
with lots of participants how do you decide what files to allocate to
whom and how do you coordinate changes over time so you don't end up
with everyone hosting the same 1 file!

What you really need here is proper wide-area distributed storage
solution. We tried to build something along these lines last year
running a tahoe grid: <http://grid.okfn.org/> (more info in [3],
requirements in [2])

I would still say this is a (long) way from success due to the various
social and technical issues involved:

* you need your grid software to be (very) easy to install
* you may have significant storage and b/w impositions on users
* (most significant) you really need *big* scale to provide
reliability and availability -- unlike with say distributed processing
projects where activity can happen any old time and people can
allocate their processing whenever they want. With a storage grid you
really need either a) massive scale b) strong commitment to
participation from peers, to avoid problem that groups of users going
offline or leaving the grid doesn't compromise availability.

[1]: http://wiki.okfn.org/p/Data_Distribution
[2]: http://wiki.okfn.org/p/Distributed_Storage
[2]: http://wiki.okfn.org/p/Distributed_Storage/Plan

> Would be interesting to liaise with them re: registry of open data
> (i.e. CKAN and suchlike...).

It would be no problem to create ckan packages linking to the relevant
torrents. Anyone with contacts with the biotorrents people so we could
have a chat?

Rufus




More information about the open-science mailing list