[ckan-dev] FAQ/RFC on sql data store.

Mon Jul 23 10:49:32 UTC 2012

On 23 July 2012 09:26, David Raznick <kindly at gmail.com> wrote:

> Hello All
>
> Decided to do this RFC in a FAQ like format to make it easier to read.
>  There is a copy of this here
>
> https://docs.google.com/document/d/1LW8Xx_wegpewGgcKejYquH9E4K9ckxe8lJ8IP5L6PlQ/edit
>  where you can also add comments.
>
> Where does the data go?
> In a postgres database configured by the ckan.data_write_url config
> option which is a sqlalchemy url.
>
> Why the write in the data_write_url?
> You will also be forced to specify a data_read_url if you want anyone
> doing any sql queries.  This url should have a user that only has
> select privileges on the data.
>

could these be ckan.datastore_write_url etc as data is so ambiguous

as an aside we should have a process for introducing config options for
sensible naming (and for routes)

>
> People will be able to do any sql statement they want? is that safe or
> sane?
> Well people will be able to do any *select* statement they want and
> that will be enforced by the database.  See the below question about
> DOS attacks if you care about this.
>
> Whats the api like then?
> We will just implement it as logic functions like the rest of CKAN and
> will part of core.  After that we may add some nicer api functions
> that use these but that is a secondary concern.
>
> Why not service oriented architecture?
> This is the best way for us to make sure it is tested, safe, gets
> implemented quickly and is well supported. It also means that it will
> have some consistency with the rest of the software.  It will be
> optional of course.  We should not really touch core except for
> permissions so if we decide to make this a service at some point then
> it should not be that hard.
>
> So what are the logic functions then?
> data_create
> data_delete
> data_update
> data_show
> data_sql
>

again datastore_create etc

>
> What is the JSON input format for data_create?
> To begin with it can have the following keys. It is fairly consistent
> with Max Ogdens' gut.  Except adds resource_id.
> {
>     resource_id: resource_id # the data is going to be stored against.
>     fields: a list of dictionaries of fields/columns and their extra
> metadata.
>     records: a list of dictionaries of the data eg  [{"dob": "2005",
> "some_stuff": ['a', b']}, ..]
> }
> Notes:
>     * The first row will be used to guess types not in the fields and
> the guessed types will be added to the headers permanently.
> Consecutive rows have to conform to the field definitions.
>     * rows: can be empty so that you can just set the fields
>     * fields are optional but needed if you want to do type hinting or
> add extra information for certain columns or to explicitly define
> ordering.
>          eg  [{"id": "dob", "label": ""Date of Birth", "type":
> "timestamp" ,"concept": "day"}, {"name": "some_stuff":  ..].   A
> header items values can not be changed after it has been defined nor
> can the ordering of them be changed.  They can be extended though.
>
> What json does data_update take?
> {
>     resource_id: resource_id # the data is going to be stored against.
>     rows:  a list of dictionaries of the data as like the create case.
>     match_fields: a list of fields that the data will be deduped
> against. All rows must have these fields.  eg ['key1', key2'].
> optional as can use _id field.
>     upsert:  true or false.  If keys are not there then do an insert
> instead
> }
>
>
David this seems to have changed radically from our chat last week where it
sounded like the data was write once, never update - is this because I
didn't pay any attention to what you were saying, you were talking about
something entirely different, or you've decided on a large change in
functionality?

cheers

Toby

> What json does data_delete take?
> {
> resource_id: resource_id # the data is going to be deleted.
> filters: dictionary of matching conditions to delete
>        e.g  {'key1': 'a. 'key2': 'b'}  this will be equivalent to
> "delete from table where key1 = 'a' and key2 = 'b' "
>        if the dictionary is empty {} the table will be truncated.   If
> filters not supplied table will be dropped>
> }
>
> What json does data_show take?
> {
>     resource_id: resource_id # the data is going to be selected.
>     filters : dictionary of matching conditions to select
>        e.g  {'key1': 'a. 'key2': 'b'}  this will be equivalent to
> "select * from table where key1 = 'a' and key2 = 'b' "
>     q: full text query
>     size: limit the amount of rows to size default 100
>     offset: offset the amount of rows
>     fields:  list of fields return in that order, defaults to all
> fields in header order.
>     sort: comma seperated field names with ordering e.g "fieldname1,
> fieldname2 desc"
> }
>
> Note: There will be an option to add _f_fieldname=value to this
> dictionary and the fieldname:value will be added to the filters
> dictionary. This is aid url getting i.e
> /api/action/data_show?resource_id=fdsfs&_f_name=bob.
>
>
> What does it return?
> {
>     fields: same as what data_create takes.
>     total: # total matching records without size or offset ...
>     # list of matching results
>     records: [ same as what data_create takes
>
>     ]
> }
> On error will return:
>
> {
>     __error__: ...
> }
>
> What json does data_sql take?
> **This requires data_read_url in your config and some special postgres
> config setting described below.**
> {
> sql: a single sql select statement.
> }
>
> What types are allowed?
> Aim to support as many postgres/postgis types that have string
> representations + 2 special types:
> json:  when supplying data the rows values have nested json i.e field
> contains json loads data.
> json_string:  the values in the rows are strings (not nested objects)
> but the content of the string are json.  i.e field contains json dumps
> output.
>
> How about ids?
> Each row in a table will be given an _id column which has an id
> generated by us which you can use in queries.
>
> Any other great features?
> Each row will store the _full_text index of all the data in the row.
> At some later point there will most likely be a way to index fields
> add constraints.
>
> Will we get DOS allowing sql?
> There will be an option to stop running long queries against the
> database ckan.data_statement_timeout (default to 100).  We will not
> accept any multiple statement queries.  To make it possible to detect
> this the following options need to be changed postgres config.
> standard_conforming_strings=on
> backslash_quote=off
> If possible we will throw an exception if these are not set correctly
> in the database.
> With the above options we can easily parse input to see if the
> statement contains multiple queries.
> These options have been considered at length and they should be safer
> than any python managed time out.
> We will also have a ckan.data_query_length parameter (defaulting to
> 1000) which means the queries themselves can not be to long.
>
>
> Thanks
>
> David
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20120723/d26ed78a/attachment-0002.html>