[ckan-discuss] Trouble installing CKAN
James McKinney
james at opennorth.ca
Mon Aug 20 02:29:52 BST 2012
Switching gears a little, I'm now having a look at dataproxy. For the original issue that started this thread, try running: http://jsonpdataproxy.appspot.com/?url=https://commondatastorage.googleapis.com/ckannet-storage/2012-08-18T163227/custom_qc_cmp_details.csv&type=csv You'll get the message "Data transformation failed. Reason: ". When I run the same API call on a local instance of the dataproxy, it works. I deployed the app to Google App Engine, ran the API call again, and got the same error as for jsonpdataproxy.appspot.com. If I check the logs in the App Engine dashboard, I see another message before the error:
"The file at https://commondatastorage.googleapis.com/ckannet-storage/2012-08-18T170006/seao_details.csv has length 378"
That's the wrong length. It's the wrong length because dataproxy only follows a single redirect, but when getting files on Google Storage from App Engine, there may be sometimes more than one redirect (and indeed, in other contexts too). I've written a pull request to use a redirect limit instead of a simplistic "one redirect" rule: https://github.com/okfn/dataproxy/pull/7
However, that doesn't solve the problem. Getting the wrong content length isn't what causes the error. I removed the try/catch around "result = transformer.transform()" to see what the exception is. It's line "handle = urllib2.urlopen(self.url)" that throws <class 'google.appengine.api.urlfetch_errors.ResponseTooLargeError'>. The maximum response size is 32MB. See https://developers.google.com/appengine/docs/python/urlfetch/overview#Quotas_and_Limits
I've created a pull request to make the error messages more useful, so that it's clearer when there is a App Engine platform error: https://github.com/okfn/dataproxy/pull/9
Not being able to use Data Preview on files larger than 32MB isn't great. Is there a plan to host dataproxy on a server that doesn't impose this arbitrary limit? Or maybe switch to a different strategy for getting data into Recline?
While going through all this, I notice that dataproxy adds a separator argument to brewery's CSVDataSource to support TSV, but brewery already has that with the excel-tab dialect. My pull request concerning that is here: https://github.com/okfn/dataproxy/pull/8
I also notice that dataproxy ignores characters that it can't read as UTF-8: https://github.com/okfn/dataproxy/commit/945bd1b0a486df64f254b2fbc7bb8b525d69e36e It'd be better to at least transliterate them to preserve some sense to the data. Unfortunately App Engine doesn't have Iconv, which allows setting both "translit" and "ignore" flags. Given that App Engine also doesn't support large files, maybe we should reconsider hosting dataproxy on App Engine?
James
On 2012-08-19, at 1:19 PM, James McKinney wrote:
> I'm going through more documentation today. "ckan.storage.directory" (mentioned in the docs) doesn't seem to be used in the code. Isn't the correct name for this setting "ckan.storage_dir"? Seen here: http://docs.ckan.org/en/ckan-1.7.1/configuration.html Also, yesterday, I had also forgotten to set site_url. This pull request adds a note about setting site_url in appropriate parts of the documentation: https://github.com/okfn/ckan/pull/102
>
> It seems that local file storage should work with Data Preview, so I'm now looking for other sources for the error. In the celeryd log, I'm getting URL unobtainable. When I make the error message more helpful (see pull request: https://github.com/okfn/ckanext-archiver/pull/2) I get "URL unobtainable: HTTP 404 on 'http://4m58.localtunnel.com//en/storage/f/2012-08-19T154553/dons_aux_partis_politiques_du_quebec.csv'"
>
> This is due to the double slash. Maybe if I were running under Apache or other, the double slashes would be normalized to a single slash and the URL would resolve. Whatever the case, the double slash issue should be resolved. It turns out on Aug 1, this regression was introduced by https://github.com/okfn/ckan/commit/e2073a37ac16acb1233a19f42d6473e0a2065b75 Fixed in pull request: https://github.com/okfn/ckan/pull/104
>
> With that fixed, I now get this error in Data Preview: "Could not load preview: DataStore returned an error (Elastic Search did not return a mapping)". I assume this is because I haven't set up Nginx to proxy to ElasticSearch. Will try that now.
>
> Please merge my pull requests!
>
> James
>
> On 2012-08-19, at 2:28 AM, James McKinney wrote:
>
>> I recently uploaded CSV files greater than 50MB in size to datahub.io (links below), and the Data Preview gives the unhelpful error: "Could not load preview: DataProxy returned an error (Data transformation failed. Reason: )" The Data API also does not work.
>>
>> http://datahub.io/en/dataset/bafc5264-c2b0-44d5-a76d-2215b0e1c9da/resource/415d32d9-aa4f-491c-8556-6e895e7eef01
>> http://datahub.io/en/dataset/96888a16-4be5-4bf1-9dc6-793f6541e94d/resource/15b717a5-4088-4443-8b70-81555dde237c
>> http://thedatahub.org/dataset/registre-qc/resource/9afe589f-2cc6-4d0c-bfa3-8c79c889a8f8
>> http://datahub.io/en/dataset/registre-ca/resource/473efd3b-0d3d-41d2-b136-9d3249220449
>>
>> So, in order to discover (and maybe fix) the error, I installed the latest CKAN locally. I also sent an email about this error via the Datahub contact form. I'm now having trouble setting up the DataStore. The documentation is thin: http://docs.ckan.org/en/latest/datastore.html
>>
>> When I enabled the DataStore without setting up Nginx or adding the Datastorer plugin, I got a JavaScript error: "Uncaught TypeError: Object.keys called on non-object recline.js:3180". I already had ElasticSearch running.
>>
>> I then installed the Datastorer plugin, started the celery daemon and restarted the CKAN server. I added a new resource, and now I get this error message in the preview: "Could not load preview: DataProxy returned an error (Request Error: Backend did not respond after 5 seconds)" and this JavaScript error: "Failed to load resource: the server responded with a status of 500 (Internal Server Error) http://jsonpdataproxy.appspot.com/?callback=...&url=http:////storage/f/2012-08-19T050704/ca_corps_scraper.csv&..."
>>
>> I'm using local storage, which given the above error seems to not work with the Recline Data Explorer. I've submitted a pull request to document this. https://github.com/okfn/ckan/pull/103
>>
>> I installed boto and configured CKAN for S3. After restarting CKAN and celeryd and starting a new resource upload of a 50MB CSV, I see "POST http://my-bucket-name.s3.amazonaws.com/ 400 (Bad Request)" in the Chrome console. I try a smaller CSV (4MB) and it works. I go halfway with a 25MB CSV and it also works. (????) Also, these CSVs are now considered to be binary/octet-stream. Why? When using local file storage and on datahub.io, they were considered text/csv.
>>
>> With S3 hitting a dead end, I then setup Google Storage, where I needed to know that I had to click the "Interoperable Access" button to get to the legacy system of access keys and secrets. Google seems to prefer the use of OAuth 2.0. With Google Storage however, uploading any size CSV causes this error to appear in the console: "Failed to load resource: the server responded with a status of 400 (Bad Request) http://jpmckinney-ckan.commondatastorage.googleapis.com/".
>>
>> Anyway, the main issue is that the links at the start of this post have errors in the Data Preview section. It'd be great to know how to correct the S3 and Google issues, though.
>>
>> James
>>
>>
>>
>>
>>
>
More information about the ckan-discuss
mailing list