[open-bibliography] [ANN] OFS - a python bitstream storage API w/ S3/pairtree/archive.org backends

Ben O'Steen bosteen at gmail.com
Thu Sep 9 14:06:34 UTC 2010


http://openbiblio.net/2010/09/09/introducing-ofs-a-python-bucketobject-storage-library/

Many of the members of this list likely share the same problem of
storing bitstreams in an object- or uri-orientated manner so forgive me
for posting about this general API for storage here.

Blog post text follows:

-----------------------------------------------------------------

Many internally distributed storage systems – such as Amazon’s S3
service or Riak’s key-value architecture –  have similarities in the
manner in which data is labelled and subsequently retrieved. This is
often because the systems themselves use a distributed hash table or a
similar distribution algorithm to disperse and then later find the data
they store.

OFS is a python library that seeks to capitalise on their similarities –
providing a single, general API to put and get files from one of these
services while hiding the specifics of the implementation from the user.
This allows for local testing and development before transitioning to
using one of the cloud services, services which typically cost real
money and slows down testing due to the necessity of communicating with
these services over an internet connection.

Characteristics of OFS:

      * Uses a ‘bucket/label’ mechanism to identify individual files
      * Provides a list of content in a given bucket (as best as that
        the service can provide)
      * Provides per-file metadata in so far as the service can provide
        (key-value or JSON encode-able data)
      * Current backend plugins:
              * Local storage – based on the pairtree specification that
                optimises file-distribution across a native file-system
                to handle large quantities of files. Uses JSON to encode
                arbitrary metadata about the files in a given bucket.
              * Remote storage (S3 and Archive plugins written
                by Friedrich Lindenberg (pudo) who has also made large
                contributions to the codebase):
                      * Amazon S3
                      * Archive.org
                      * Riak (in progress)
              * Also in progress – a REST Client by Friedrich Lindenberg
                (pudo)
              * One key desire is to provide opaque sharding – breaking
                up very large files to spread across buckets or even
                systems to improve performance and the range of services
                or backend systems OFS can make use of.

It is plain that having the ability to write storage code in a common
way, but make use of local as well as remote ‘cloud’ storage is of a
great benefit. It encourages file storage to be codified in a
distribute-able manner so that scaling later on is easier.

This is a work in progress, but the local implementation is intended to
be both a reference implementation as well as useful testing or even
production backend for storage. Other backends potentially will have
less comprehensive metadata support for individual files, but these
‘limits’ will be included as optional warnings or exceptions once we
have a handle on what they are.

Please comment or give feedback on this library. Also, we would welcome
any patches for other backend support to the library!

http://bitbucket.org/okfn/ofs






More information about the open-bibliography mailing list