[ckan-dev] FAQ/RFC on sql data store.
Toby Dacre
toby.okfn at gmail.com
Mon Jul 23 10:49:32 UTC 2012
On 23 July 2012 09:26, David Raznick <kindly at gmail.com> wrote:
> Hello All
>
> Decided to do this RFC in a FAQ like format to make it easier to read.
> There is a copy of this here
>
> https://docs.google.com/document/d/1LW8Xx_wegpewGgcKejYquH9E4K9ckxe8lJ8IP5L6PlQ/edit
> where you can also add comments.
>
> Where does the data go?
> In a postgres database configured by the ckan.data_write_url config
> option which is a sqlalchemy url.
>
> Why the write in the data_write_url?
> You will also be forced to specify a data_read_url if you want anyone
> doing any sql queries. This url should have a user that only has
> select privileges on the data.
>
could these be ckan.datastore_write_url etc as data is so ambiguous
as an aside we should have a process for introducing config options for
sensible naming (and for routes)
>
> People will be able to do any sql statement they want? is that safe or
> sane?
> Well people will be able to do any *select* statement they want and
> that will be enforced by the database. See the below question about
> DOS attacks if you care about this.
>
> Whats the api like then?
> We will just implement it as logic functions like the rest of CKAN and
> will part of core. After that we may add some nicer api functions
> that use these but that is a secondary concern.
>
> Why not service oriented architecture?
> This is the best way for us to make sure it is tested, safe, gets
> implemented quickly and is well supported. It also means that it will
> have some consistency with the rest of the software. It will be
> optional of course. We should not really touch core except for
> permissions so if we decide to make this a service at some point then
> it should not be that hard.
>
> So what are the logic functions then?
> data_create
> data_delete
> data_update
> data_show
> data_sql
>
again datastore_create etc
>
> What is the JSON input format for data_create?
> To begin with it can have the following keys. It is fairly consistent
> with Max Ogdens' gut. Except adds resource_id.
> {
> resource_id: resource_id # the data is going to be stored against.
> fields: a list of dictionaries of fields/columns and their extra
> metadata.
> records: a list of dictionaries of the data eg [{"dob": "2005",
> "some_stuff": ['a', b']}, ..]
> }
> Notes:
> * The first row will be used to guess types not in the fields and
> the guessed types will be added to the headers permanently.
> Consecutive rows have to conform to the field definitions.
> * rows: can be empty so that you can just set the fields
> * fields are optional but needed if you want to do type hinting or
> add extra information for certain columns or to explicitly define
> ordering.
> eg [{"id": "dob", "label": ""Date of Birth", "type":
> "timestamp" ,"concept": "day"}, {"name": "some_stuff": ..]. A
> header items values can not be changed after it has been defined nor
> can the ordering of them be changed. They can be extended though.
>
> What json does data_update take?
> {
> resource_id: resource_id # the data is going to be stored against.
> rows: a list of dictionaries of the data as like the create case.
> match_fields: a list of fields that the data will be deduped
> against. All rows must have these fields. eg ['key1', key2'].
> optional as can use _id field.
> upsert: true or false. If keys are not there then do an insert
> instead
> }
>
>
David this seems to have changed radically from our chat last week where it
sounded like the data was write once, never update - is this because I
didn't pay any attention to what you were saying, you were talking about
something entirely different, or you've decided on a large change in
functionality?
cheers
Toby
> What json does data_delete take?
> {
> resource_id: resource_id # the data is going to be deleted.
> filters: dictionary of matching conditions to delete
> e.g {'key1': 'a. 'key2': 'b'} this will be equivalent to
> "delete from table where key1 = 'a' and key2 = 'b' "
> if the dictionary is empty {} the table will be truncated. If
> filters not supplied table will be dropped>
> }
>
> What json does data_show take?
> {
> resource_id: resource_id # the data is going to be selected.
> filters : dictionary of matching conditions to select
> e.g {'key1': 'a. 'key2': 'b'} this will be equivalent to
> "select * from table where key1 = 'a' and key2 = 'b' "
> q: full text query
> size: limit the amount of rows to size default 100
> offset: offset the amount of rows
> fields: list of fields return in that order, defaults to all
> fields in header order.
> sort: comma seperated field names with ordering e.g "fieldname1,
> fieldname2 desc"
> }
>
> Note: There will be an option to add _f_fieldname=value to this
> dictionary and the fieldname:value will be added to the filters
> dictionary. This is aid url getting i.e
> /api/action/data_show?resource_id=fdsfs&_f_name=bob.
>
>
> What does it return?
> {
> fields: same as what data_create takes.
> total: # total matching records without size or offset ...
> # list of matching results
> records: [ same as what data_create takes
>
> ]
> }
> On error will return:
>
> {
> __error__: ...
> }
>
> What json does data_sql take?
> **This requires data_read_url in your config and some special postgres
> config setting described below.**
> {
> sql: a single sql select statement.
> }
>
> What types are allowed?
> Aim to support as many postgres/postgis types that have string
> representations + 2 special types:
> json: when supplying data the rows values have nested json i.e field
> contains json loads data.
> json_string: the values in the rows are strings (not nested objects)
> but the content of the string are json. i.e field contains json dumps
> output.
>
> How about ids?
> Each row in a table will be given an _id column which has an id
> generated by us which you can use in queries.
>
> Any other great features?
> Each row will store the _full_text index of all the data in the row.
> At some later point there will most likely be a way to index fields
> add constraints.
>
> Will we get DOS allowing sql?
> There will be an option to stop running long queries against the
> database ckan.data_statement_timeout (default to 100). We will not
> accept any multiple statement queries. To make it possible to detect
> this the following options need to be changed postgres config.
> standard_conforming_strings=on
> backslash_quote=off
> If possible we will throw an exception if these are not set correctly
> in the database.
> With the above options we can easily parse input to see if the
> statement contains multiple queries.
> These options have been considered at length and they should be safer
> than any python managed time out.
> We will also have a ckan.data_query_length parameter (defaulting to
> 1000) which means the queries themselves can not be to long.
>
>
> Thanks
>
> David
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20120723/d26ed78a/attachment-0002.html>
More information about the ckan-dev
mailing list