[ckan-dev] Alternatives to ckanext-importlib and using something like pandas for bulk updates to the datastore

Colum McCoole colum.mccoole at btinternet.com
Tue Jan 21 22:32:13 UTC 2014


I'm relatively new to this forum and to CKAN, so please excuse the apparent
gaps in my knowledge.

I'm posting this here on David Read's suggestion.  I had been questioning
David on the ckanext-importlib earlier and he thought it best to poll
expertise on this forum.

 

Part of what brought me to ckanext-importlib was my exploring the
possibilities for automating data upload to CKAN (and the datastore).  Among
my data sources, I have many large excel files (multi-sheet) that are always
changing (new data) . sometimes expanding down (new rows), sometimes
expanding across (new columns).  As I understand it, CKAN doesn't really
handle this for you out of the box (just taking the first sheet from an
excel, although I believe it's possible to set up a single resource per
sheet).  Reprocessing new versions of the spreadsheet overwrites previous
versions of the same resource . and although I see CKAN generates a
revision_id, I don't think this gives it any temporal capability.  Will
datapusher in version 2.2 handle excel any differently? 

 

Ideally, my solution would try to handle the following: where version 2 (for
a simplified illustration in the screenshot below) is the latest version of
a resource and where I would somehow be able to capture any new data (in
red), since version 1, when I schedule a refresh, where the fact that the
'value' 2 on January 2nd gets captured as having historically been that
value before being revised.  If anyone has any pointers on how that might
best be achieved within the framework of CKAN (if at all), I would be very
receptive to those suggestions.

 

cid:image002.png at 01CF16CE.6138CC20

 

David suggested, in lieu of using importlib (which he thinks might be too
complicated), something built on messytables and extracting line-by-line
into a package dictionary and using package_create with ckanapi
<https://github.com/wardi/ckanapi>  to push to the datastore.

 

I was wondering if anyone had used modules like pandas
<http://pandas.pydata.org/>  in conjunction with CKAN to push dataframes
en-masse to postgres.  This
<http://stackoverflow.com/questions/9826431/fast-insertion-of-pandas-datafra
me-into-postgres-db-using-psycopg2>  SO thread suggests one approach with
psycopg2, although the bottom of this
<https://github.com/pydata/pandas/issues/4163>
(https://github.com/pydata/pandas/issues/4163) thread discusses the
implementation using SQLAlchemy, with coverage of Postgres, which is perhaps
more in line with CKAN.  The bottom of that same thread also has an
interesting request for discussion around pandas versus dataset
<https://github.com/pudo/dataset>  (which I believe is written by
members/collaborators of the OKF stable) versus tablib
<https://github.com/kennethreitz/tablib> .  Is it likely that CKAN
Transformations <https://github.com/okfn/ckan/wiki/CKAN-Transformations> ,
discussed on this forum a few weeks back, will use one of these three
modules? 

 

Sorry .. perhaps the last part has become too open-ended . but I just wanted
to throw it out to those more in-the-know on these matters.

 

Kind regards,

Colum

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20140121/f8f300a6/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 8129 bytes
Desc: not available
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20140121/f8f300a6/attachment-0002.png>


More information about the ckan-dev mailing list