[okfn-discuss] Fwd: freetable.org first draft specification

Jonathan Gray jonathan.gray at okfn.org
Sun Jan 10 01:53:33 UTC 2010


Thought this might be of interest. Text file pasted below.

Jonathan

---------- Forwarded message ----------

Attached is the first draft specification for freetable.org.

I am sending it to you because you indicated via the freetable.org
website an interest in helping freetable as a developer.

Feedback on the spec is welcomed.

thanks,
gordon

[This is all very preliminary, alterations welcome.]

freetable.org specification
***************************

Version 0.1

We provide a hosted real time repository for shared data.

Organizational concerns
=======================

Organizational structure
------------------------

At least initially, leadership in the operation of the organization
will rest with me.  In general though I hope most decisions will be
made cooperatively, and I will try to use any final decision making
authority wisely and benevolently.

Ambitious
---------

This is an ambitious project.  This means there is a high risk of
failure.  What I have learned from past startup experiences is what
you start off doing is often not what you end up doing, so in that
sense it could still be a success.  The migration of data to the net
space seems ripe for innovation.

It is important to be ahead of the curve, so that even though
something might not always make sense today, it will in the future
when computers/networks are bigger/faster/cheaper.  Youtube is a good
example of this.  When they started it was crazy to host and stream
video.  Now as a result of getting an early start they host and stream
video for the entire planet.

Data licensing
--------------

Copyright in data will reside with the submitter or original copyright
owner.  Licensing has two parts, the license we seek from data
submitters, and the license we grant to third parties using the data.

The license we seek should be as broad as possible, so that we can
relicense under various terms in the future without having to contact
every submitter.  We want to avoid the situation Wikipedia and Linux
find themselves in of being unable to change licensing terms in order
to adapt to the future.

The license we grant should be equally broad so people may make the
maximum use of our data possible.  The two areas where we might impose
restrictions are to prevent freeloading of some datasets, and
requiring attribution to publicize the role we are playing.

For datasets that don't originate with us, we may need to alter the
licensing terms, but all datasets should be free.

Code licensing
--------------

We will use the GNU Affero GPL for the code we develop.  The Affero
GPL is like the regular GPL but closes the software as a service
loophole, in which people running GPLed software as a service can
hoard their changes.

Need to consider, is there a risk of an existing for profit data
sharing site taking the code and using it, to the detriment of us
getting started and building brand name recognition.  Do we want to do
anything about this?

Illegal content
---------------

A reporting system will be set up for reporting illegal content,
finding the party that uploaded it, and blocking that account from our
service.

Bootstrapping
-------------

There is nothing worse than building a project and then finding people
aren't interested in it.  To avoid this market research will be
performed from an early stage to gauge interest, find collaborators,
and to gather email address to announce when the product is launched.
The project will be abandoned if it doesn't find sufficient interest.

Since the value of many of the envisioned database tables is
proportional to the number of people using them, similar to Wikipedia
it could take a period of years, to reach a critical mass.  This needs
to be planned for, and argues for getting something rough out as soon
as possible, and then working to improve it while waiting to reach
critical mass.  To accelerate adoption we will seed the database with
currently available free data.

We will initially attempt to support simple, in demand, data sets for
particular problem domains.  e.g. Helping programmers deal with
problems in the country/country code and zipcode/city domains.  From
this we will be in a position to judge project traction, and the
issues to be resolved as we move forward.

Revenue
-------

The data we host will be free as in speech, but not necessarily as in
beer.

Revenue is critical to us.  We don't have the luxury of being VC
funded.  This means we will have to charge from early on, but we must
do so in a way that doesn't alienate users.  Targeting high volume
users seems a good way to go.

Revenue may come from:
  - charges for transactions above a certain limit (below the limit
    will be free)
  - premium charges for using particular tables (particularly where
    new content has to be reviewed)
  - premium service and support (e.g. aws.amazon.com)
  - service level agreements which provide a guaranteed uptime
    (e.g. geonames.org)

Costs
-----

Costs would appear to go through three stages.

We are just starting out.  Hardware / bandwidth costs and operations
costs are minimal and can be supported by our initial funds.

We are meeting with moderate success.  Hardware / bandwidth costs are
ramping up, but it is operations costs that dominate.  The importance
people are placing on our service means we have to hire one or more
operations staff that can be on call should anything go down.  This is
probably the most financially challenging time for the organization.
Too big to support from our initial funds, and struggling to make ends
meet based on charges.

We are highly successful.  Hardware / bandwidth costs dominate, and in
relative terms operations costs are small.  Charges are able to cover
our expenses and we are operating at a social surplus.

Marketing
=========

Market research
---------------

We conducted a small $100 Google Ad Words marketing campaign prior to
commencing development.  The purpose of this campaign was three fold:
  - gauge interest in the project; and abandon if little interest
  - identify potential collaborators
  - start building a buzz around the project for when we launch

Marketing
---------

We have a new product, not something customers will be searching for.
Therefore our market efforts have to be focussed on reaching out to
potential customers, not waiting for them to contact us.  It is also
important that we try and generate a buzz around our product.

Reaching out is difficult because there don't appear to be any generic
mailing lists dealing with web development.  Instead there are lists
on more specific topics such as PHP, where our publicity would be less
appropriate.  We might have to try and reach out to programmer
bloggers, but how to do this isn't clear, especially when we are at
such an early stage.

Since marketing is a challenge, this argues for creating the simplest
implementation of the product possible, and trialing it out, rather
than creating the be all and end all of implementations.

Branding
--------

At least until we are well known, we will request that people using
our data reference us, and possibly create a little "Data by
freetable.org" logo people can display.

What would be Internet scale?
-----------------------------

quantcast.com gives the following data for top websites (these are all
of the top 50 websites for which page views per day are reported):

      site        rank (US people/month)    page views/day
    weather.com         12                        30m
    answers.com         16                        10m
    wordpress.com       25                        25m
    hulu.com            35                        20m
    whitepages.com      37                         8m
    huffingtonpost.com  42                        10m
    evite.com           47                         7m
    tmz.com             49                         5m

These numbers are surprisingly small.  i.e. what we are planning seems
surprisingly feasible.

Suppose we want to build a system that can support 10 of the top 50
websites (we don't but this is the sort of scale we would like to
aspire to).  And each site does 20m page views per day.  Assume that
for each page view 1 lookup is performed on our server.  Then we would
need to be able to handle 2,000 lookups per second.

Another way of coming at the problem is to very loosely estimate the
Internet as comprises 500m users who each perform 20 page views per
day.  Suppose we want to build a system that can scale to 2% of the
total Internet.  Then, once again, we need to be able to support 2,000
lookups per second.

Now, what does it take to support such a system...

Network connections.  Amazon's smallest machine instance type
(m1.small) is able to handle 6,000 TCP connections per second.  And
this is with python code, with the cpu less than half busy.

Network bandwidth.  Assume an average of 10 x 1kB records of data per
lookup.  2,000 lookups per second is 20MB/s of data.  For comparison,
Amazon's small instance is reported to max out at around 50MB/s.

Network latency.  Ping times from Amazon range from 2ms (facebook.com,
hulu.com, huffingtonpost.com, evite.com), 3ms (google.com), 9ms
(answers.com), 40ms (wordpress.com), to 70ms (whitepages.com).
Looking up data in the database will add at least one round trip to
the time it takes to load the page.  For the well connected sites just
listed this shouldn't be an issue, but for the poorly connected sites
this could be an issue, although perhaps they should be improving
their connectivity.

RAM.  Amazon rents machines with from 2 to 64GB of RAM.  It is simple
to partition datasets based on table name.  Assuming a 1k record size
and no indexing overheads this means we could support tables of up to
64 million records.  For most datasets this is plenty.

Disk.  Imagine a table that doesn't fit in RAM and has to be stored on
disk.  Assume disks 2,000 lookups per second, and a disk latency of
20ms.  Then we would need 40 spindles, plus fault tolerance.  This is
doable, but far more difficult than the table in RAM case.  It
probably makes sense to treat it as a separate problem to be
architected.

RAM bandwidth.  RAM bandwidth could be estimated assuming a worst case
index outcome that requires us to to step though and compare almost
every record in each of two indexes.  Amazon's small instance is
reported to have around 5GB/s of main memory bandwidth.  With 2,000
lookups per second, this translates into 2.5MB per lookup, or assuming
16 bytes per index entry, 80,000 records individually stepped through
and compared per query.  This is inadequate for the multi-million
record datasets we would like to be able to support.  It suggests that
at least initially we either reduce the size of the datasets we
support to the 100,000 record size range, try to avoid pathological
indexing, or we develop a multi-instance architecture from the
beginning.

MySQL performance.  If we use MySQL as part of an initial
implementation we have two performance issues to grapple with.
Queries per second and rows scanned per second.  A m1.small Amazon
instance can perform 2,200 queries per second with a persistent
database connection (InnoDB was slightly faster than MyISAM), and scan
around 350k rows per second.  The queries per second performance puts
us at around "Internet scale".  In a similar amount to one query we
would only be able to scan around 160 rows, so avoiding row scanning
queries and/or billing for queries that scan many rows is absolutely
key.

What are Internet scale costs and prices?
-----------------------------------------

Assume 2% of Internet scale: 2,000 lookups per second.

Assume 10 x 1kB records returned per lookup.

Network bandwidth.  Amazon charges up to $0.17/GB for outbound
bandwidth.  We have about 50TB/month.  So, total bandwidth costs would
be around $9k/month.

CPU utilization.  Suppose due to RAM space and bandwidth requirements
we are running 10 small instances with 10 small instances as backups
and 10 small instances for development and testing.  Amazon charges
$0.085/instance/hour.  So total instance costs would be around
$2k/month.

Operations.  Uptime is critical.  When operating at Internet scale it
probably isn't sufficient to have someone on call, we probably need to
have someone monitoring and able to react in real time 24x7.  This
level of coverage requires a team of at least 5 people.  Cost for that
might be $1m/yr, or around $100k/month.  This cost swamps our other
costs.  This argues for charging for service contracts and SLAs, not
for data usage.

Per lookup cost.  Total costs are around $100k/month and we receive
around 5 billion lookups/month.  So cost per lookup is around
$0.00002, or $0.20 per 10,000 lookups.

Reasonableness of price.  Imagine an ad supported website.  This is
probably the worst financial case user.  Typical click through rates
range from 0.1% to 1.0%.  Assume the low end of this range, and assume
5 ads per page.  Now a click might bring in $0.50 of revenue.  So
revenue per page view is around $0.0025.  Using our service may cost
$0.00002, which is 1% of per page view revenue.  Despite our
conservative assumptions, this would seem affordable.

We estimate non-click through ads take in around $0.30 per 1,000 views
at the low end, or $0.0003, so with 5 ads on the page this is $0.0015,
which is in line with our click through estimates.

Price for a large website.  Imagine a top 1,000 website using our
service once per page view:

      site        rank (US people/month)    page views/day
    cafemom.com        496                         2.0m
    maxpreps.com       499                         1.2m
    primarygames.com   502                         3.0m
    4shared.com        503                        45m
    aboutus.org        507                         0.4m

Ignoring 4shared.com (which appears to be a file sharing site), a
typical top 1,000 website is going to do around 2m page views per day.
So a typical top 1,000 website is going to have to pay around $1,200
per month to use our service.  It probably makes sense to make our
service free below a certain usage limit, so the top 1,000 sites are
going to have to help carry this cost, and a more reasonable price
might be around $2,000/month.  This might seem a lot, but it isn't for
a site that might very roughly be taking in $100k - $1m per month.

In summary, our costs seem reasonable, and we could reduce them by a
factor of 10 if we could find a way to eliminate the operations costs.
These costs seem manageable, even without charging premium rates for
access to any tables.

Sample data
-----------

We will initially focus on the web software domain and attempt to
provide databases dealing with:
  - countries/country codes
  - zip codes/city names
  - and if possible IP address/geographic location

We may establish a small fund to buy data that users want and add it
to the database.  Care would have to be taken to respect copyright and
licensing terms.

Architectural parameters
------------------------

2% of Internet scale: 2,000 lookups per second.

Reads are far more common than writes to the database, say 20 to 1.

Initial focus is on small, less than 1GB, high use tables, capable of
being stored in RAM.  (If you have to go to disk, especially for
random queries you are dead performance wise.  And RAM prices will
continue to drop, so what constitutes a small table will increase in
the future.)

Architectural decisions
=======================

Implementations
---------------

This is a large and challenging application.  Fortunately it appears
implementable in stages.  We will make one or more prototype releases
in order to learn about the system we are developing and obtain
feedback.  Much of the code from prototype releases will likely be
thrown away.

Prototype releases should not be used for building production
applications.

High Availability
-----------------

Fault tolerance, and maintaining system uptime, is key to everything
we do.  Others can only build on top of us if we are totally reliable.

Platform
--------

Initially we will use Amazon EC2.  Compared to purchasing hardware,
one of the big advantages of EC2 for a startup is it costs relatively
little money upfront, and unneeded capacity can be readily sheded.
Avoiding the front-loading of expenses fits well with the cost of
capital / expected rate of return for a startup.

Google AppEngine is another possibility.  Unfortunately the AppEngine
datastore is unable to handle dynamically named tables, and requires
all table names to be known at app load time.  This is a deal breaker
for us.

Seeking hardware donations is the third possibility.  Since we are not
legally a non-profit, it isn't clear this would be successful, and we
would still have to pay for a cage, power, and network bandwidth.  It
would also force us into dealing with a mixture of incompatible
systems.

We will run Ubuntu Linux.

Languages
---------

Absent any other reason Python will be used (has the momentum of
Google behind it, but doesn't support threads).  Other options are
Java (harder to develop in, supports threads), Ruby (harder to
maintain, doesn't support threads), and PHP (harder to maintain,
doesn't support threads).  Threads are a big deal because we will be
storing large tables in RAM and would then like to allow multiple
processors to be able to access the tablespace.

Database choice
---------------

It probably makes sense to create two database implementations.  memdb
for datasets that fit entirely in memory, and diskdb for datasets that
don't.  This would be much easier than designing one system that works
well for both.  memdb is needed initially, but the development of
diskdb could be postponed for a significant period of time.

memdb might have multiple read instances and a single write instance.
Depending on whether we need threads, Java might be the best
implementation language for memdb.

For diskdb providing fault tolerance would likely be the most
challenging aspect.

Initially for memdb we will use MySQL with the MyISAM (only caches
indexes not data within the MySQL process) or InnoDB storage engines,
depending upon performance for an in memory workload.  The Memory
storage engine appears ill suited due to storing varchars as their
greatest size.  If we use MySQL it will need to be tuned, since the
default configuration is for a minimal system.

Longer term, at least for large datasets, we may use Hadoop Hbase.
However this has a steep learning curve, would require mapping our
database query language to Hbase calls, and also comes with a large
amount of infrastructure we might need to maintain.

Amazon's SimpleDB is not a long term option because it is limited to
10GB per table.

Past the initial prototype stage it is likely we will use memcached on
specific tables to improve performance.

Initial architecture
--------------------

        load
      balancer                         slave
          |                              :
    ------+--> front end / memdbd ---> slave
    query |              :
          |    front end / memdbd      master (tables A1, A2, ...)
          |        |
          |    centraldb
          |        |                   slave
          |     website                  :
          |                            slave
          |
          |                            master (tables B1, B2, ...)

The tables in the system are grouped and assigned to specific clusters
of machines.  Each cluster serves a specific group of tables.

Queries come in to any one of the front end machines via a load
balancer.  The front end checks the username/password associated with
the query and ensures the user is not over quota by checking with the
centraldb machine and caching this information for say 10 seconds.

The front end pre-parses the query to figure out if it is a select or
an insert/update/delete and the name of the table being operated on.

The front end consults the centraldb machine to find out which cluster
of machines has the requested table.  The result is cached by the
front end machine for say 10 seconds.

centraldb won't have any custom developed software, and will simply be
running a MySQL instance.  When a master starts up it will send a list
of the tables it serves (from the f_table_info table), and an
identifier for the cluster (from the f_cluster_info table).  When any
master or slave machine starts up it will send its role (master or
slave), the cluster identifier, and the current time to the centraldb
machine.  The current time time stamp will be updated every 10 seconds
or so by the master or slave machine while the machine is running, and
can be used to detect if a master or slave machine goes down.

Assuming we only have a single type of database, the front end will
passe the request to memdbd, the memdb driver, which is co-resident.
Depending upon performance (an Amazon m1.small instance can perform
800 database connect/disconnects per second), the memdbd's may
connection pool connections to each of the master and slave machines,
although this becomes non-scalable at some point.  A slave machine to
contact, if a slave machine is being contacted, will be chosen at
random.

memdb parse the query and if it is valid executes it.  The executed
query might be quite different from the original query, particularly
for insert/delete/update queries, where permissions need to be
checked.

The slaves replicate the master MySQL database.  For each cluster
there is a single master and potentially multiple slaves.  The
multiple slaves are present to handle the cpu / row scanning load.
The tables they are operating on are limited to fit entirely within
RAM.

The billing/quota database is part of centraldb and records system
usage information.  The billing database load could be extremely high.
To mitigate this the billing database will only be contacted
periodically (perhaps upon the user using 1 cpu second, or after 10
second of the user having used that particular frontend), and will
update that users billing/quota information, and compute whether that
user should be blocked from service for over use.

During initial deployment all tables may reside on a single cluster,
and master and slave may be the same machine.  During development it
must be possible for everything to reside on a single machine.

Thought must be given to operational concerns.  In particular how to
backup data (EBS snapshots to S3), how monitor load, how to switch
tables between clusters (remove table from centraldb write list,
dump/load table, drop old table; this scheme involves downtime for
writes) and how to add/remove slaves (EBS snapshot, instantiate
volume, MySQL replication).

Data format
-----------

Tabular data is easy for programmers to manipulate.  Markup languages
and JSON allow far richer data values.  We therefore propose a
compromise.  Data values are stored internally as JSON hash objects
(in the "f_json" column of the table), but this JSON is stored inside
a table that also contains indexes for some of the JSON data values.
Data is communicated with the programmer in tabular form with only
richer data sub-elements returned as JSON.

One of the advantages to storing data values in JSON is it allows what
Google terms the Expando model, in which new fields can be dynamically
added to data values.

BigTable allows fields that can take on multiple values, and they are
useful for expressing things such as:

    <likes>chocolate</likes>
    <likes>ice-cream</likes>

Such fields can be expressed as JSON arrays or hash objects to true.

Initially, to ease implementation, we may create data columns for all
fields.

Website
=======

Website
-------

A website will be maintained at

    http://freetable.org/

In addition to information about the project, it will contain a
page for each table in the database describing in detail the contents
and format of that table.

The website shall possess both a schema editor, and a data value
editor.

Tables
======

Table creation, modification, and deletion
------------------------------------------

At least initially tables shall be created and modified based on the
agreed table definition definition by hand.  Default values shall be
supplied for all fields.

Experimental namespace
----------------------

Every table will initially be in an experimental namespace, indicating
it should not be relied on an is subject to incompatible changes.
Such tables will be named "exp_<username>_<tablename>".  When a table is
mature it will be moved to the main namespace.  The "rename table"
command can be used for this, but there is a problem handling queries
for the old namespace.  Since we have to parse commands anyway to make
sure they are valid, mapping table names shouldn't result in too much
additional difficulty.

Allowing edits without loosing data
-----------------------------------

To prevent accidental or malicious loss of table data it is necessary
to deploy a change approval system.  There are two possible
approaches.  Store unapproved changes in a backing table, or store
unapproved changes in the production table and flag them as
unapproved.  Both approaches have merit.  We have chosen the second
approach, but could revert to the first approach if we run into
implementation difficulties.

Reserved names
--------------

Names beginning "f_" are reserved for freetable meta-data.

When creating user tables and columns care will have to be taken to
avoid using names that are reserved words in MySQL.  One way around
this would be to internally prefix every table and column name with a
fixed string.

Names beginning "u_" are reserved for user data.

Meta-data
=========

Per physical record meta-data
-----------------------------

Each table shall have a number of meta-data columns.

Only the f_id and f_license_id fields should be used by normal
applications.  The other fields are not considered stable and may
change from release to release.

f_id.  This field uniquely identify a logical record in the table.
Multiple physical rows in a table may share the same identity so as to
represent and store different revisions of the same logical record.
This column is also needed to be able to discuss errors in the
dataset.

f_pid.  This field uniquely identify a physical record in the table.

f_current.  This boolean is used to indicate the current
approved value of a logical record.  A given logical record will have
at most one current record.  A physical record may not be current
either because it has been superseded by more recent data, or the
table is curated and the record has yet to be approved.

f_deleted.  This boolean is used to indicate in a physical
record that a logical record should be deleted.  Only when f_current
and f_deleted are both true is the logical record considered to be
actually deleted.

f_new.  This boolean is used in curated tables to indicate that a new
physical record has been submitted for review and has not yet been
approved/rejected.

f_owner.  This field indicates the owner username of each physical
record.  This is the party that would be contacted to discuss the
record, or if the record appeared in error.  A database command will
be provided for the table owner to change the f_owner field.  (Do we
need to limit this allowing the table owner to change the f_owner of
just the current physical record to themselves?  Probably not.)

f_submitter.  This field indicates the owner submitter of each
physical record.  Unlike the f_owner field it is immutable.

f_submitter_ip.  The IP address used by the submitter to submit the
physical record.  The value of this field may need to be kept
confidential.

f_license_id.  A brief textual identifier indicating the licensing
terms under which the physical record is provided by the submitter.

f_time.  This field is used to indicate the time the physical record
was added.  It can be used to facilitate the physical deletion of old
data from certain tables.

f_json.  The entire original logical value encoded as JSON.

<indexed_columns>.  Any fields in the original logical value that are
used to construct indexes.  Column names beginning "f_" are reserved.

Per column meta-data
--------------------

The f_column_info table shall record per-column meta data.

The definition of this table is not considered stable and may change
from release to release.

f_table_name.  The name of the table.

f_name.  The name of the column.  Used by the table browser.

f_type.  The type of the column.  Used by the table browser.

f_default.  The default value for the column.  Used by the table
browser.

f_nullable.  Boolean indicating whether the column is nullable.  Used
by the table browser.

f_prev.  The name of the previous column in the table.  Null for the
first column.  Used by the table browser.

f_description.  A textual description of the column.  Used by the table
browser.

Per index meta-data
-------------------

The f_index_info table shall record per-index meta data.

The definition of this table is not considered stable and may change
from release to release.

f_table_name.  The name of the table.

f_name.  The name of the index.  Used by the table browser.

f_type.  The type of the index.  'normal' or 'fulltext'.  Used by the
table browser.

f_columns.  The columns indexed by the index, in order, comma
separated.  Used by the table browser.

We do not allow unique indexes since they are incompatible with our
storing multiple physical records per logical record.

Per table meta-data
-------------------

The f_table_info table shall record per table meta-data.

The definition of this table is not considered stable and may change
from release to release.

f_table_name.  The name of the table.

f_alias_for.  The name of any alias for this table.  This is used when
migrating a table from the experimental namespace to the stable
namespace.

f_id_curr.  This field stores the value used for the id field of the
most recently created logical record in the table.  It is incremented
for each new logical record.

f_ro.  This boolean field stores whether the table is read/write or
read only.  Read only is used while the table is being migrated.

f_owner.  The username of the owner of the table.

f_curated.  This boolean indicates whether the table is curated, and
changes made to it other than by the table owner need to be approved
by the table owner, or if changes made to it happen immediately.  A
database command will be provided for the table owner of a curated
table to approve and make current or reject a physical record
associated with a logical record.

f_perm.  Permissions associated with the table, as described in the
permissions section.

f_index_columns.  The columns indexed by all indexes on the table,
comma separated.

f_description.  A textual description of the table.  Used by the table
browser.

f_license_id.  A brief textual identifier indicating the licensing
terms under which data in the table is made available to others.

Per cluster information
-----------------------

The f_cluster_info table shall record per cluster information.  It
shall possess a single row.

f_cluster_id.  The cluster identifier.

License information
-------------------

The f_license table contains information regarding licenses, these may
be either licenses of data to us, or licenses of data from us.

f_license_id.  A brief textual identifier uniquely identifying the
license.

f_title.  A one line title for the license.  Used by the table
browser.

f_text.  The license text, free from any markup.  Used by the table
browser.

The f_license_permitted table contains information regarding which
licenses we permit for data received by us on a per table basis.

f_table_name.  The name of the table.

f_license_id.  A brief textual identifier uniquely identifying a
permitted license for new physical records in that table.

Permissions
-----------

A permission system is needed to prevent someone deleting all the
data.

A username is a variable length character string representing a user
of our service.  It may denote either an individual or a company.

The username "root" is reserved for future use.

Each table shall have a permission set indicating what the table owner
can do as far as logical records are concerned ('create',
'modify/delete'), a permission set indicating what the current logical
record owner can do as far as logical records are concerned
('modify/delete'), and a permission set indicating what an arbitrary
user can do as far as logical records are concerned ('create',
'modify/delete').  Only the table owner shall be able to modify
permissions.  Permissions shall be encoded as strings 'cmcmcm' to
'------' for all to none, in the order table owner, record owner, and
other.

Database commands shall be defined to allow the table owner to modify
permissions.

In addition to username/password access to tables and rows, a simple
possibly hashed password may be used to allow delegation by open
source apps to users not in our system.

The permission system may not be implemented for the first prototype
release, but is essentially for when people use the database for real
apps, since without it anyone can delete data.

The permission system described here is only rudimentary, and a more
complete permissions system will allow groups of users, and the
delegation of permissions to other users.

Email addresses
---------------

Email addresses stored in the database must not be exposed lest they
be harvested by spammers.

A field named "x" of type "email", may be written and stores into
field "f_email_x", which can't be accessed externally.

Reading "x" yields the opaque value:

    "user-" + encrypt(<user_performing_read>-<email_address>) + "@freetable.org"

which routes through freetable.org, where a remailer forwards the
email to the actual recipient.  The incorporation of the user
performing the read into the email address is intentional.  It allows
us to readily identify and meter which user is responsible for
publicizing the email address.  An IP address that sends email to more
than a few addresses in a 24 hour period will be blocked.

Email addresses will not be implemented for the first prototype
release.

Keyword search
--------------

It is common to want to search an SQL text field for keywords that
match a given keyword string.  For instance searching free form
classified ads for "car red".

We will provide an index type and query term that supports this.  This
will probably be implemented using the MyISAM FULLTEXT index type,
Apache Java Lucene or Zend PHP Lucene libraries, or Sphinx (doesn't
allow incremental additions to the index).

This is a substantial undertaking.  Keyword search will not be
implemented for the first prototype release.

Database interfaces
===================

Data export
-----------

We will create tools to export data in CSV, JSON, and XML formats,
SQL, PHP, Python, ... programmatic representations, and a HTTP/AJAX
interface to the data.  In addition to the entire data set we shall
construct a query language that allows retrieval of the data that has
changed/been added in the last n seconds.

This shall not be available in the first prototype release.

Command line interface
----------------------

A rudimentary database CLI shall be written, hopefully supporting
readline, and pretty printing of select output.

The CLI shall not be available for the first prototype release.

REST
----

A REST interface to the database shall be created that accepts queries
directed at db.freetable.org and returns JSON.

DBI
---

Database interface libraries shall be written initially for PHP and
Python, and then for other languages, that conform to their respective
language DBI specifications.  These interface libraries shall connect
to the database converter running on db.freetable.org.

The advantage of DBI over REST is persistent connections (really an
implementation detail), the disadvantage is difficulty of
implementation.

For the non-prototype release we will endeavor to have these DBI
interfaces incorporated into their respective language distributions,
but the road to doing this and having it show up on the user's system
is a long one.

Security
--------

For both the REST based interface and DBI a password has to be
transmitted.  We will leave it to the user to decide if they want to
send it in the clear or encrypted using TLS/SSL.  The disadvantage of
encryption is it adds at least one round trip to the connection
latency.

Query language
--------------

The query language will be based on a very limited form of SQL.  These
limitations are designed to ensure scalability and portability to
different database implementation, such as BigTable, and may be
relaxed in the future.

There is a difficult balancing act between expressive power of the
query language and computational efficiency.  We might want to expand
the query language allowed for small datasets.

Select: must select named fields only, no wildcard select.  From:
single table, no joins.  Where sequence of field_name op value clauses
anded together, where op is =, !=, <, <=, >, >=.  Fields must be
fields on which an index has been defined.  Later we may allow joins
to unique columns, and similarly possibly views.

Insert: named field insert only.

Delete: similar constraints on where clause as for select.

Selects would not require the user be authenticated, while inserts and
deletes would.  This would allow use of the database by anonymous
apps, but if they wish to contribute data they must authenticate.

The above constraints are only approximate.

Abuse prevention
================

Signup
------

Signup is only required for write access to the database or for read
access above any free quota.

Increasingly sophisticated forms of signup to prevent abuse will be
required.

Resource usage
--------------

Quotas and billing are needed for number of requests, bytes
downloaded, and bytes stored.  It isn't clear how to bill for data
stored, since this is often done as a public good.

Cpu usage is a difficult issue.  Unless we are careful the database
could become overloaded by poorly running queries.  MySQL provides no
way to limit cpu usage.  Nor does it provide any way to measure cpu
usage ("show processlist" can show current elapsed time, but this
disappears once the thread finishes, and elapsed time is an
inappropriate metric for billing).  Instead we will kill threads that
have exceeded a given elapsed time.  We can do this globally using the
"show processlist" / "kill" commands, or by monitoring from memdb.
This later alternative has the advantage of allowing the future
extension to allow longer running queries for users that have
explicitly asked for them.  Explain command is another alternative for
detecting commands that are likely to be long running.  Billing
elapsed time / (1 + load avg) x ECUs or RAM bandwidth is one
possibility, but it would be difficult to aggregate nodes if we are
using something like Hbase.  Whatever we choose, we should expose
estimated cpu time through the _cpu query parameter.

Disk I/O usage is another difficult issue.  Amazon EBS volumes are
striped, but there is very little data available on the resulting
performance they offer.  This needs further study.  We might want to
bill each insert/update/delete as the equivalent of 20ms of cpu time
reasoning that eventually it has to hit the disk.

Quotas will not be implemented for the first prototype release.

Competitive landscape
=====================

Swivel
------

http://www.swivel.com/

Swivel is "YouTube for Data"; it allows users to upload and graph data.

What we are doing is different than Swivel:
   - the data we are dealing with is frequenty non-numeric (containing
text fields)
   - the data sets we can handle can be too large to be retrieved in
their entirety (queries performed on server)
   - multiple users can contribute to the data set
   - Swivel displays the data graphically to the end user
   - we provide programmatic access to the data

Google Fusion Tables
--------------------

http://tables.googlelabs.com/public/tour/tour1.html

Google Fusion tables is a shared editor for tabular data (similar to a
spreadsheet)

Google recently announced an API for accessing Fusion Tables:

http://code.google.com/apis/fusiontables/docs/developers_guide.html

The API provides a simplified SQL like interface to the data that is
communicated using REST.

Google Fusion Tables is quite close to what we are doing.
Optimistically you might say it validates our plans.

What we are doing is different than Google Fusion Tables:
   - Google is a for profit company
   - Google Fusion Tables lacks an infochimps.org like index to the
publicly available datasets
   - we may be operating at a larger scale than envisioned by Google
Fusion Tables (currently 100MB limit)
   - "Applications using the Google Fusion Tables API can send a
maximum of 5 requests per second to the Google Fusion Tables server"
     this would prevent its use in websites that are hoping to scale
   - Google Fusion Tables provides an editor to access the data
   - Google Fusion Tables doesn't allow non-authorized users to
suggest table modifications

Freebase
--------

http://freebase.com/

Freebase is a creative commons licensed community edited data collection.

Freebase is produced by Metaweb http://www.metaweb.com/ .  They own
datacommons.org, although they presently aren't doing anything with
it.

What we are doing is different than Freebase:
   - Freebase's creators operate on a for profit basis
   - Freebase today doesn't contain the sorts of programmatic data we
are considering
   - Freebase uses a non-conventional non-tabular graph data model
   - Freebase uses MQL Json based query language rather than something
that is more SQL like
   - Freebase doesn't support any operators other than equality eg. no age >= 21
   - Freebase doesn't support a null data value
   - As such Freebase is better for storing knowledge than storing
programmatic data

What we learn from Freebase:
   - A project similar to what we envision can be successful ($42m in
Series B funding)

Internet Archive
----------------

http://www.archive.org/

Internet Archive is an archive of a lot of Internet related data

What we are doing is different than Internet Archive:
   - we provide programmaticly structured and accessible data
   - we allow programmers to insert/update/delete data

Wikipedia
---------

http://www.wikipedia.org/

Wikipedia is a community edited online encyclopedia

What we are doing is different than Wikipedia:
   - we provide programmatic access to programmaticly structured data

What we learn from Wikipedia:
   - the power of user contributed data

Google base
-----------

http://www.google.com/base/ http://code.google.com/apis/base/

What we are doing is different than Google base:
   - Google operates on a for profit basis
   - Google base is primarily designed for submitting items to Google to index
   - Items in Google base appear to need to have an associated URL
   - Google base doesn't contain the sort of curated programmatic
content we are considering

Datapedia
---------

http://ideas.repec.org/p/pra/mprapa/9012.html

What we are doing is different than datapedia:
   - Datapedia is an idea rather than an implementation
   - Datapedia is more concerned with data revision history

Amazon public datasets
----------------------

http://aws.amazon.com/publicdatasets/

Amazon public datasets are large datasets that others can instantiate
on EC2 instances.  Datasets exist for genetics, census, and freebase
facts.

What we are doing is different:
  - Amazon only deals with large 1G-1T datasets
  - Amazon requires you run an EC2 instance
  - Amazon doesn't allow users to contribute data to datasets

infochimps.org
--------------

http://infochimps.org/

Infochimps maintains a large database of datasets.

Infochimps is quite close to what we are doing.  Optimistically you
might say it validates our plans.

What we are doing is different:
  - Infochimps licensing terms are specified by the dataset; not all
datasets are free
  - Infochimps makes money charging a commission for access to
non-free data sets
  - Not all datasets are hosted at infochimps
  - In infochimps each dataset has its own format
  - Infochimps doesn't provide programmatic access
  - Infochimps is as much a directory of databases we could use as a competitor

What we learn from infochimps:
  - The importance of making it easy for users to contribute datasets

theinfo.org
-----------

http://theinfo.org/

Theinfo.org is a website for people dealing with large datasets.

What we are doing is different:
  - theinfo.org only provides a mailing list for discussing large datasets

CKAN
----

http://www.ckan.net/

The Comprehensive Knowledge Archive Network is a database of open
datasets.

What we are doing is different:
   - ckan.org only provides a directory of datasets

CKAN is a project of the Open Knowledge Foundation.  The okfn-discuss
mailing list might be a good place for publicizing our project.

DBpedia
-------

http://dbpedia.org/

DBpedia is a semantic web style database of facts extracted from
Wikipedia, that can be queried using a somewhat SQL like query
language.

What we are doing is different than DBpedia:
   - DBpedia only contains data from Wikipedia
   - DBpedia uses a non-conventional non-tabular RDF tuple data model
   - DBpedia's query language isn't very SQL like

Open issues
===========

More consideration of the following issues is warranted:

  - geonames.org provides a REST HTTP interface to post code and other
    related data.  They are able to handle lat/long -> country code.
    This couldn't be implemented using tables and select.  Should we
    implement a functional interface to data rather than select.  This
    would be more flexible, but might seem alien to programmers
    (although less so if we used REST).  From a marketing perspective
    though, if we claim to be a database then we have to act like a
    database.  In addition to greater flexibility REST has the
    advantage of allowing higher performance implementations, at the
    expense of potentially being harder to parse XML (or maybe not
    harder if we use JSON).




More information about the okfn-discuss mailing list