[okfn-discuss] OKF grid, Tahoe-LAFS, Cassandra, MongoDB
Zooko O'Whielacronx
zooko at zooko.com
Wed Oct 6 06:40:50 UTC 2010
Dear Rufus Pollock and other OKF folks:
I'm one of the maintainers of Tahoe-LAFS. I'm joining this
conversation in order to learn more about use cases that Tahoe-LAFS
might serve now or in the future and in order to contribute some of my
expertise to OKF in your search for a storage solution. In addition to
contributing to Tahoe-LAFS, I also happen to have recent experience
using Cassandra at SimpleGeo.com, so of all people in the whole world
who have experience with both Tahoe-LAFS and Cassandra, I'm definitely
one of them. ;-)
I've reviewed the discussion that Rufus started on the tahoe-dev
mailing list nine months ago [1]. Back then I thought that what Rufus
was asking for sounded reasonable enough, and much of it seemed
definitely doable, but for some of it I wasn't really sure of the
details—what specifically was required and if it was a reasonable
thing to want or if it was even possible to implement it all. I'm
still not entirely sure today, and I'm interested in seeing how some
other tools such as MongoDB provide for OKF's needs. If it can, then
that example can show me how Tahoe-LAFS can be used likewise. If it
can't, then this gives me increased confidence that the original
desiderata for the OKF grid were too strong.
In this note I'll talk about first encryption and then space accounting.
Let's tackle the issue of encryption, because I think it is kind of a
red herring and I hope to get it out of the way and concentrate on the
really hard issues. Tahoe-LAFS's encryption can be understood as:
1. Create a unique symmetric encryption key for every file, and
encrypt that file with it.
2. Embed that encryption key into the file handle for that file.
Now I understand that for OKF's purposes all files are supposed to be
public. This is a perfectly good policy and it is a use case that
Tahoe-LAFS is intended to support. If you want a set of files to be
public, you simply make them accessible, such as is done on
Tahoe-LAFS's public demo directory, here:
http://pubgrid.tahoe-lafs.org/uri/URI%3ADIR2%3Actmtx2awdo4xt77x5xxaz6nyxm%3An5t546ddvd6xlv4v6se6sjympbdbvo7orwizuzl42urm73sxazqa/
I argue that the difference between Tahoe-LAFS and any other
distributed storage system is one of degree, not of kind. Tahoe-LAFS
makes it easy to make your files private (while hopefully also making
it similarly easy to make your files public). Other distributed
filesystems make it easy to make your files public, and don't make it
easy to make them private. However no distributed filesystem can make
it *impossible* for you to encrypt your files before storing them. So,
while I admit that it could be a problem that Tahoe-LAFS makes it
*easy* to do so—for example people might do so accidentally—any other
distributed system could face a similar problem if users were to do so
deliberately.
In other words, I think of it as more a potential usability issue than
a security issue. Usability issues are important and I don't mean to
belittle it, but in practice I'm not sure that it would be a big
problem. I would want to wait to get empirical evidence in the form of
usage reports from the field to learn what sorts of usability issues
crop up in practice.
Next, let's talk about the "space accounting" issue. This one I
definitely understand as being a reasonable thing to want and a thing
that could be feasibly implemented. Let's distinguish between two
goals:
Goal 1: I want to allow users to read (download) files without thereby
allowing them to write (upload) them.
Goal 2: I want to allow server operators to contribute space on their
storage server without thereby allowing them to consume space on other
storage servers.
Goal 1 is already possible using an HTTP proxy in front of the
Tahoe-LAFS gateway. This is already done in practice, as recently
discussed on the tahoe-dev list [2].
Goal 2 is much trickier. To allow goal 2, as has been mentioned on
this thread, Tahoe-LAFS developers have a plan to add strong
distributed space accounting in the future, which plan we haven't made
much progress on in the last nine months.
What interests me for the OKF grid is: what are the alternatives? From
my experience using Cassandra I'm pretty sure that it is even less
capable than Tahoe-LAFS is at goal 2, and it can be served up behind
an HTTP proxy just as well as Tahoe-LAFS can. I would assume (without
knowing much) that the same goes for MongoDB and couchdb and every
other system on the planet. :-)
So in sum, Tahoe-LAFS already allows goal 1 and is actually used that
way in practice, and Tahoe-LAFS might in the future (especially if
someone else pitches in and helps) achieve goal 2, which no other
current system to my knowledge can offer either.
Oh, we should really think about another goal which wasn't explicitly
mentioned before but which is probably actually very important:
Goal 3: I want to allow server operators to contribute space on their
storage server without thereby allowing them to overwrite or delete
files on other storage servers.
Tahoe-LAFS already offers goal 3, and I'm pretty sure that it is the
only system that offers goal 3 and the only one that is likely to in
the near future. (I would love to be proven wrong.)
Okay, so now that I've sat down and written this letter, it sounds to
me like maybe Tahoe-LAFS is a reasonable tool for OKF to move forward
with after all. Or at least, it isn't that much more unreasonable than
any alternative that I know of. ;-)
I'm sorry that I didn't figure this out and write this letter nine
months ago when you first asked, but honestly, I was uncertain. In the
time that has passed since then I've learned a lot and gotten familiar
with Cassandra. It wasn't until I actually wrote this letter that I
thought things through in these terms.
Regards,
Zooko
[1] http://tahoe-lafs.org/pipermail/tahoe-dev/2009-June/001985.html
[2] http://tahoe-lafs.org/pipermail/tahoe-dev/2010-October/005336.html
More information about the okfn-discuss
mailing list