[ckan-dev] FIxing the DataStore vs FileStore duality confusion

Thu Mar 14 12:21:51 UTC 2013

On 03/14/2013 06:30 AM, Sean Hammond wrote:
> Just to update the dev list on this, there's a wiki page here outlining
> the problems and suggested solutions:
>

Hi, Sean.

Some of these ideas look good for clearing up confusion....

But they don't address any of the major architectural issues I've run 
into in actually trying to deploy datastore:

a) There's no error reporting. When something goes wrong, an error is 
logged in a place only a sysadmin can find. When an invalid value makes 
its way into postgress --- an invalid column name or an invalid data 
value (depending on the column type) --- the datastorer and the 
datastore API don't propagate the problem to the caller. Some of this 
can be addressed by better validation up-front. But in any 
paster/celery/cron-based loading process, it is hard to propagate errors 
to the data owner.

b) Automatic type guessing only works if you're lucky. The messytables 
type guessing code uses the first ~100 rows to guess types. Imagine a 
sorted table will all NULLs at the top. Or sorted integers that range 
from 0 to a really high number, so you can't see whether you need a 
32-bit or 64-bit data type until the bottom. And even if you look at the 
entire table, it may be impossible to make the right guess between 
integer and float --- or, is "00005" an integer or a string? --- without 
domain knowledge. Similarly, should empty strings, "NA", or "N/A" be 
treated as null values? The user will often need to be able to set a 
schema explicitly.

c) In order to accommodate existing data files, it may also be necessary 
for the user to explicitly set file loading options: the data format, 
the delimiter, etc.

d) Are tables indexed? Without indexes on the right columns, query 
performance could easily be so slow that an API call would create a 
denial-of-service problem. Any column that can be filtered (currently 
all columns) should be indexed. But the ability to take in raw SQL 
queries (especially ones that join tables) worries me that someone could 
inadvertently or purposefully craft a query that runs effectively 
forever, regardless of how the tables are indexed, as long as there is 
at least one large table in the database.

e) When uploading a large table in batches (e.g. 1000 rows at a time), 
the datastore may be in an incomplete state. It would be helpful to be 
able to deactivate public access to the table while it is being uploaded.

I've been working on (a)-(c) in a new datastore loader script:

https://github.com/tauberer/datastore-loader

Thanks,

-- 
- Joshua Tauberer
- http://razor.occams.info