[ckan-dev] Some help/guidance with pushing large data volumes directly to datastore tables

Colum McCoole colum.mccoole at btinternet.com
Fri Mar 14 16:02:21 UTC 2014


I had posted the following as an issue under ckanapi
<https://github.com/ckan/ckanapi> , but @wardi <https://github.com/wardi>
suggested I try here for some input. I've been experimenting a bit over the
past week with his excellent ckanapi. I'm hitting some obstacles in handling
large files though, in terms of running out of memory on a 4GB ubuntu VM and
I was hoping someone might be able point me in the right direction to try to
resolve.  I think the crux of the issue is streaming data into CKAN, but I
have limited python experience with this.  

In order to debug my issues, I've been using an ipython notebook using
pandas and ckanapi, as well as a few other modules. With this approach, I
seem to be able to capture the python errors more easily.  My use case is to
be able to push 500k-plus records (22 fields) to a datastore table (not
necessarily lightning fast, but reliably and consistently). This equates to
about 140MB in csv form and >300MB in what I think is the equivalent of the
jsonl <http://jsonlines.org/>  format that is supported in the ckanapi CLI
(although @wardi has since confirmed that that is only intended for ckan
meta-data . datasets etc..  rather than the datastore data).

I abandoned trying to feed a file upload (either manually or via url), since
CKAN doesn't even attempt to ingest this file size. I also tried pointing at
a URL to the file on S3, but again, datapusher doesn't even try to tackle
this.  I'm running CKAN in a docker container (using @kindly
<https://github.com/kindly> 's work) . so maybe part of my problem is that I
should be exposing port 8800 for this to work.  Anyone else tried this?

Rather than relying on the url/datapusher approach (since my url pointing to
the file potentially keeps changing), I'm trying to use the datastore action
API commands to interface with the datastore table directly. At present, in
order to prevent the kernel on the ipython notebook from constantly crashing
from being out of memory, I'm splitting a dataframe containing the circa
500k records into various pieces (6) ... adding a single line in conjunction
with a datastore_create and then the rest in 100k chunks using
datastore_upsert ... and while this kind of works ... it still comes back
with a 504 error (see bottom) and tries to emit back all the added records
in the 'out' cell in the notebook (I'm not sure if there's a recommended way
to suppress this).

dfprev1=dfprev[:1]
dfprev2=dfprev[2:100000]
dfprev3=dfprev[100001:200000]
dfprev4=dfprev[200001:300000]
dfprev5=dfprev[300001:400000]
dfprev6=dfprev[400001:]

This is some of the code I'm using to transform the dataframe object in
pandas into something equivalent to jsonl. I referenced this SO thread:
<http://stackoverflow.com/questions/20639631/how-to-convert-pandas-dataframe
-to-the-desired-json-format>
http://stackoverflow.com/questions/20639631/how-to-convert-pandas-dataframe-
to-the-desired-json-format.  You can probably deduce my python skills!

output = StringIO.StringIO() # a stringio used to convert to jsonl
dfprev6.to_json(path_or_buf=output, date_format='iso', orient='records')
#dfprev6 is a slice of the larger dataframe
contents = output.getvalue() # this is used to bring back the json
records_new = pd.json.loads(contents) # and then assign this to a records
string
mysite.action.datastore_upsert(resource_id='0a8462d3-4c81-474a-bf84-3f2941ac
67c0',
records=records_new,
force=True, primary_key=['ID_BB'])

Presumably some sort of streaming method to feed the large dataframe in
chunks and passing this to the ckanapi would work better, but I'm not sure
the best approach. I was wondering if anyone might have some sample code
that would achieve this.

As an alternative route, I also tried the relatively new to_sql feature in
pandas (you need the master branch), which now apparently works with
postgres.  I can get it working with smaller dataframes, but the larger ones
create out-of-memory issues for me too.  If I get this alternate approach to
work with datastore tables (not using the ckan api), do I need to manage
adding '_id' and '_full_text' fields myself to the tables.  Are these
necessary?

Thanks for any input on this. Colum


This is the 504 error I get back when trying to push 100k records in
increments using datastore_upsert.


CKANAPIError Traceback (most recent call last)
in ()
1
mysite.action.datastore_upsert(resource_id='0a8462d3-4c81-474a-bf84-3f2941ac
67c0',
2 records=records_new,
----> 3 force=True, primary_key=['ID_BB'])

/usr/local/lib/python2.7/dist-packages/ckanapi-3.3_dev-py2.7.egg/ckanapi/com
mon.pyc in action(**kwargs)
48 data_dict=nonfiles,
49 files=files)
---> 50 return self.ckan.callaction(name, data_dict=kwargs)
51 return action
52

/usr/local/lib/python2.7/dist-packages/ckanapi-3.3_dev-py2.7.egg/ckanapi/rem
oteckan.pyc in call_action(self, action, data_dict, context, apikey, files)
80 else:
81 status, response = self.requestfn(url, data, headers, files)
---> 82 return reverse_apicontroller_action(url, status, response)
83 
84 def requestfn(self, url, data, headers, files):

/usr/local/lib/python2.7/dist-packages/ckanapi-3.3_dev-py2.7.egg/ckanapi/com
mon.pyc in reverse_apicontroller_action(url, status, response)
104 
105 # don't recognize the error
--> 106 raise CKANAPIError(repr([url, status, response]))

CKANAPIError: [' <http://172.17.0.2/api/action/datastore_upsert>
http://172.17.0.2/api/action/datastore_upsert', 504, u'\r\n

504 Gateway Time-out\r\n\r\n


504 Gateway Time-out


 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20140314/d1edfa12/attachment-0002.html>


More information about the ckan-dev mailing list