[ckan-dev] FIxing the DataStore vs FileStore duality confusion
Joshua Tauberer
tauberer+consulting at govtrack.us
Thu Mar 14 12:21:51 UTC 2013
On 03/14/2013 06:30 AM, Sean Hammond wrote:
> Just to update the dev list on this, there's a wiki page here outlining
> the problems and suggested solutions:
>
Hi, Sean.
Some of these ideas look good for clearing up confusion....
But they don't address any of the major architectural issues I've run
into in actually trying to deploy datastore:
a) There's no error reporting. When something goes wrong, an error is
logged in a place only a sysadmin can find. When an invalid value makes
its way into postgress --- an invalid column name or an invalid data
value (depending on the column type) --- the datastorer and the
datastore API don't propagate the problem to the caller. Some of this
can be addressed by better validation up-front. But in any
paster/celery/cron-based loading process, it is hard to propagate errors
to the data owner.
b) Automatic type guessing only works if you're lucky. The messytables
type guessing code uses the first ~100 rows to guess types. Imagine a
sorted table will all NULLs at the top. Or sorted integers that range
from 0 to a really high number, so you can't see whether you need a
32-bit or 64-bit data type until the bottom. And even if you look at the
entire table, it may be impossible to make the right guess between
integer and float --- or, is "00005" an integer or a string? --- without
domain knowledge. Similarly, should empty strings, "NA", or "N/A" be
treated as null values? The user will often need to be able to set a
schema explicitly.
c) In order to accommodate existing data files, it may also be necessary
for the user to explicitly set file loading options: the data format,
the delimiter, etc.
d) Are tables indexed? Without indexes on the right columns, query
performance could easily be so slow that an API call would create a
denial-of-service problem. Any column that can be filtered (currently
all columns) should be indexed. But the ability to take in raw SQL
queries (especially ones that join tables) worries me that someone could
inadvertently or purposefully craft a query that runs effectively
forever, regardless of how the tables are indexed, as long as there is
at least one large table in the database.
e) When uploading a large table in batches (e.g. 1000 rows at a time),
the datastore may be in an incomplete state. It would be helpful to be
able to deactivate public access to the table while it is being uploaded.
I've been working on (a)-(c) in a new datastore loader script:
https://github.com/tauberer/datastore-loader
Thanks,
--
- Joshua Tauberer
- http://razor.occams.info
More information about the ckan-dev
mailing list