[ckan-dev] FAQ/RFC on sql data store.

Mon Jul 23 08:26:52 UTC 2012

Hello All

Decided to do this RFC in a FAQ like format to make it easier to read.
 There is a copy of this here
https://docs.google.com/document/d/1LW8Xx_wegpewGgcKejYquH9E4K9ckxe8lJ8IP5L6PlQ/edit
 where you can also add comments.

Where does the data go?
In a postgres database configured by the ckan.data_write_url config
option which is a sqlalchemy url.

Why the write in the data_write_url?
You will also be forced to specify a data_read_url if you want anyone
doing any sql queries.  This url should have a user that only has
select privileges on the data.

People will be able to do any sql statement they want? is that safe or sane?
Well people will be able to do any *select* statement they want and
that will be enforced by the database.  See the below question about
DOS attacks if you care about this.

Whats the api like then?
We will just implement it as logic functions like the rest of CKAN and
will part of core.  After that we may add some nicer api functions
that use these but that is a secondary concern.

Why not service oriented architecture?
This is the best way for us to make sure it is tested, safe, gets
implemented quickly and is well supported. It also means that it will
have some consistency with the rest of the software.  It will be
optional of course.  We should not really touch core except for
permissions so if we decide to make this a service at some point then
it should not be that hard.

So what are the logic functions then?
data_create
data_delete
data_update
data_show
data_sql

What is the JSON input format for data_create?
To begin with it can have the following keys. It is fairly consistent
with Max Ogdens' gut.  Except adds resource_id.
{
    resource_id: resource_id # the data is going to be stored against.
    fields: a list of dictionaries of fields/columns and their extra metadata.
    records: a list of dictionaries of the data eg  [{"dob": "2005",
"some_stuff": ['a', b']}, ..]
}
Notes:
    * The first row will be used to guess types not in the fields and
the guessed types will be added to the headers permanently.
Consecutive rows have to conform to the field definitions.
    * rows: can be empty so that you can just set the fields
    * fields are optional but needed if you want to do type hinting or
add extra information for certain columns or to explicitly define
ordering.
         eg  [{"id": "dob", "label": ""Date of Birth", "type":
"timestamp" ,"concept": "day"}, {"name": "some_stuff":  ..].   A
header items values can not be changed after it has been defined nor
can the ordering of them be changed.  They can be extended though.

What json does data_update take?
{
    resource_id: resource_id # the data is going to be stored against.
    rows:  a list of dictionaries of the data as like the create case.
    match_fields: a list of fields that the data will be deduped
against. All rows must have these fields.  eg ['key1', key2'].
optional as can use _id field.
    upsert:  true or false.  If keys are not there then do an insert instead
}

What json does data_delete take?
{
resource_id: resource_id # the data is going to be deleted.
filters: dictionary of matching conditions to delete
       e.g  {'key1': 'a. 'key2': 'b'}  this will be equivalent to
"delete from table where key1 = 'a' and key2 = 'b' "
       if the dictionary is empty {} the table will be truncated.   If
filters not supplied table will be dropped>
}

What json does data_show take?
{
    resource_id: resource_id # the data is going to be selected.
    filters : dictionary of matching conditions to select
       e.g  {'key1': 'a. 'key2': 'b'}  this will be equivalent to
"select * from table where key1 = 'a' and key2 = 'b' "
    q: full text query
    size: limit the amount of rows to size default 100
    offset: offset the amount of rows
    fields:  list of fields return in that order, defaults to all
fields in header order.
    sort: comma seperated field names with ordering e.g "fieldname1,
fieldname2 desc"
}

Note: There will be an option to add _f_fieldname=value to this
dictionary and the fieldname:value will be added to the filters
dictionary. This is aid url getting i.e
/api/action/data_show?resource_id=fdsfs&_f_name=bob.

What does it return?
{
    fields: same as what data_create takes.
    total: # total matching records without size or offset ...
    # list of matching results
    records: [ same as what data_create takes

    ]
}
On error will return:

{
    __error__: ...
}

What json does data_sql take?
**This requires data_read_url in your config and some special postgres
config setting described below.**
{
sql: a single sql select statement.
}

What types are allowed?
Aim to support as many postgres/postgis types that have string
representations + 2 special types:
json:  when supplying data the rows values have nested json i.e field
contains json loads data.
json_string:  the values in the rows are strings (not nested objects)
but the content of the string are json.  i.e field contains json dumps
output.

How about ids?
Each row in a table will be given an _id column which has an id
generated by us which you can use in queries.

Any other great features?
Each row will store the _full_text index of all the data in the row.
At some later point there will most likely be a way to index fields
add constraints.

Will we get DOS allowing sql?
There will be an option to stop running long queries against the
database ckan.data_statement_timeout (default to 100).  We will not
accept any multiple statement queries.  To make it possible to detect
this the following options need to be changed postgres config.
standard_conforming_strings=on
backslash_quote=off
If possible we will throw an exception if these are not set correctly
in the database.
With the above options we can easily parse input to see if the
statement contains multiple queries.
These options have been considered at length and they should be safer
than any python managed time out.
We will also have a ckan.data_query_length parameter (defaulting to
1000) which means the queries themselves can not be to long.

Thanks

David