[okfn-discuss] Distributed Storage: Suggestions?

Lukasz Szybalski szybalski at gmail.com
Thu Apr 23 19:57:10 UTC 2009


On Thu, Apr 23, 2009 at 9:29 AM,  <julian at goatchurch.org.uk> wrote:
> Quoting Lukasz Szybalski <szybalski at gmail.com>:
>
>> I guess the question would be: Could you describe the
> type of data you
>> currently have. (percentage of space, downloads, changes)
>>
>
> This is the directory that has broken the system (watch
> out-- it may break your browser):
>
> http://ukparse.kforge.net/svn/undata/pdf/
>
> It's several thousand large PDFs of UN documents.  The
> same would apply to scanned images, archived pages from
> Hansard, etc.
>
>
> At the moment I'm storing it in SVN as a means of
> distribution, but it unnecessarily doubles the disk
> useage, and some of the SVN clients are very unhappy
> with the size of the directory.

So for this particular case, is size the problem?

1. I wonder if using distributed repositories would help (bzr, git,
hg) (Not sure if that will help with size.) The problem with revision
control is that it doesn't track binary files, so every commit or
changes puts a new version in there instead of diff. (I think that is
the case)

2. If you want to bypass the repository, you could use file system
like zfs (instead of ext3) which has few extra features like snapshots
etc...(not sure if that would do it)(You could look at the history of
snapshots to see previous version)

As far as redounded file system with multiple nodes, the only one I
know of right now is google file system, but no source available for
that....


So set of http/ftp mirrors will need to do, with rsync from the main
server. 1tb of space is not that expensive, unless you guys host this
in some kind of "server hosting " environment.

Lucas








>
>
> SVN is entirely inappropriate for these large binary
> files (there are no versions), but it's convenient only
> because the code that handles these binary files are in
> SVN (where they belong), and the fewer means of
> distribution the better.  But it's not scaling any more.

When you say scaling? You mean?



>
>
> We need a better answer for parking the data for these
> projects, where we'd keep the scraping/parsing code in
> SVN on kforge (SVN is designed for code), and handle
> these large sets of large non-versioned files some other
> way.
>
>
>
> Julian.
>
>
> _______________________________________________
> okfn-discuss mailing list
> okfn-discuss at lists.okfn.org
> http://lists.okfn.org/cgi-bin/mailman/listinfo/okfn-discuss
>



-- 
How to create python package?
http://lucasmanual.com/mywiki/PythonPaste
DataHub - create a package that gets, parses, loads, visualizes data
http://lucasmanual.com/mywiki/DataHub




More information about the okfn-discuss mailing list