[ckan-dev] Started work on the 'webstore' (a datastore with web API)

Mon Jul 4 18:42:35 UTC 2011

On 4 July 2011 19:39, Rufus Pollock <rufus.pollock at okfn.org> wrote:
[...]
> I started out, following Francis' instructions, to extract the code
> from scraperwiki. I'm still using the datalib (that talks to sqlite)
> but have replaced web server frontend using cyclone (probably slightly
> breaking existing client API in the process). Julian has given me some
> very useful documentation which I will forward here.

And here's Julian's excellent mail ...

---------- Forwarded message ----------
From: Julian Todd <julian at goatchurch.org.uk>
Date: 29 June 2011 22:33
Subject: scraperwiki's datastore
To: Rufus Pollock <rufus.pollock at okfn.org>
Cc: Nicola Hughes <nicola at scraperwiki.com>, Francis Irving
<francis at flourish.org>

Rufus,

Doesn't seem any time to be any time to catch you in enough detail to
go through the scraperwiki datastore to see how it could be
reused/shared.  So I'll just explain a little bit of the background
while I can and you can forward this message to anyone who needs to
look at it.

The implementation is basically only 400 lines which took 2 days to
write.  There's no point getting hung up about it.  The decision was
to base it on sqlite and design it completely within the grain of that
system.

The real breakthrough was realizing that the original scraperwiki
datastore save function could be implemented with exceptional ease
using the index tables and the alter table function.

The second important feature was that the columns of a table have an
affinity, not a type, so it was almost as flexible as a key-value
store (if you weren't storing completely random crud in it).

The third feature was you could trivially attach databases in situ,
thus avoiding the need ever to load tables from one place into another
database just to work with them.

The fourth feature was that it was based on files, which are
manageable, emailable, backupable in an entirely understandable
manner, which enables all kinds of administration options that apply
to files and not to databases.  Just compare the triviality of backing
up and restoring a file to backing up a database.  There are so many
more safe options.

The only disadvantage of this solution is scalability.  (You're not
going to download the whole of open street map into it.)  However,
once you get wise to the fact that almost all distinct datasets are
relatively modest, it turns out to be an adequate solution for the
mid-range.  Large datasets tend to have their own needs that are
different in their own way, so there is never going to be one answer
that satisfies all of them.  This is why discussions about making such
a thing tends to go round in circles forever and never settles down.
But if your dataset is less than 10Mb in size, then I can't see why
doing it this way isn't always going to be satisfactory.  And I mean
that.

The code is this file:
    https://bitbucket.org/ScraperWiki/scraperwiki/src/1a6bc666693c/uml/dataproxy/datalib.py
The interface is this file:
    http://scraperwiki.com/docs/python/python_help_documentation/

The changes I intend to make of this in the near future is to make it
stateless, which means that the commit function has to go, and every
function call must come with the full attach list so that potentially
it could be done with another process.

The full expression of the design is follows.

1) Databases are indexed by a name, and the file for each database is
"name/defaultdb.sqlite"

2) A process connecting to the system can nominate a single database
it has write access to, but all databases that it attaches to are
read-only.  There are no other restrictions to what sqlite commands
can be called.  This has been implemented by those authorizer
functions at the top.

3) The save_sqlite function (which is everybody's preferred way to
avoid having to write database schemas at all) is implemented in the
last 150 lines.  It works by inspecting the table in the database and
altering it if necessary.

4) We haven't addressed the issue of file locking (cannot read from a
database that is being written to).  There are a number of ways to
solve this, such as holding the read request back until the write is
done, or creating duplicate file to attach to.  Can't tell which is
best; might implement both.

5) We need to build in authentication and privacy, so that certain
datasets are restricted to certain users.  This is a hard one to get
right; permissions systems are often buggered up.  But basically it
will come down to a function that resides somewhere of the form:
  is_this_user_allowed_to_access_this_database(username, databasename)
response: "no", "for reading only", "for writing and reading"