[ckan-dev] CKAN upload custom format tsv on schedule

Christopher Njuguna cnjuguna at Rwanda.cmu.edu
Tue Aug 18 10:11:36 UTC 2015


Hi Denis, Matthew

Sorry I have taken this long to respond. Thanks a lot for your feedback. Interesting insights and looks like a lot of work if these ideas are to be implemented in CKAN.

1) Looks like I have no choice but to write my own script to fetch the data, process it and insert it into CKAN via API like you say. I think I want to use CKAN because I need the dataset to be available for visualization and download. I wonder if I can indulge your expertise. I have come across ckanext-scheming (https://github.com/open-data/ckanext-scheming) but am not sure if this relates to only metadata or also file structure?

2) I guess #1 takes care of this like you suggested

3) #1 can be run as a cron job as you guys have already pointed out

EDIT:
I have been looking for a data analysis platform especially the ability for more advanced interactive analysis including the creation of workflows and inclusion of R/Python - scripts something like tangelohub. However, I think CKAN at its core is a data storage and sharing system and maybe a way to integrate tangelohub and CKAN is worth looking at?

Regards,

Chris
________________________________
From: ckan-dev [ckan-dev-bounces at lists.okfn.org] on behalf of Matthew Fullerton [matthew at smartlane.de]
Sent: Tuesday, August 11, 2015 10:54 AM
To: CKAN Development Discussions
Subject: Re: [ckan-dev] CKAN upload custom format tsv on schedule


Hi Chris,

We are running into this issue quite a lot and we have two and half approaches :)


1. Pull the data when needed for visualization, either directly (which rarely works because the format isn't suitable as in your case) or with live processing with a backend. I.e. do you really need the data in your CKAN or do you just need to access it?

2. Pre-process the data at regular intervals and write it to the datastore. There is a nice API for that, and it means you can archive historical values if the external datasource only delivers the live values. This is like the cron-jon option Denis mentioned.

3. Treat the external data as a stream and trigger an update any time the data changes. "Everything's a stream"...

http://www.confluent.io/blog/stream-data-platform-1/ - we are working in this direction and I'd love to talk to others who are interested in seeing this happen. I'm not sure it needs to be part of CKAN but on the other hand some kind of UI for keeping track of everything is important.


Best,

Matt


Projekt SMARTLANE

matthew at smartlane.de<mailto:florian at smartlane.de>
T +49.89.289.28575
F +49.89.289.22333
http://www.smartlane.de/en

EXIST-Gründungsvorhaben „Tapestry“
c/o Lehrstuhl für Verkehrstechnik
Technische Universität München
Arcisstraße 21
80333 München

Gefördert vom Bundesministerium für Wirtschaft und Technologie
aufgrund eines Beschlusses des Deutschen Bundestages.

________________________________
Von: ckan-dev <ckan-dev-bounces at lists.okfn.org> im Auftrag von Denis Zgonjanin <deniszgonjanin at gmail.com>
Gesendet: Montag, 10. August 2015 15:55
An: CKAN Development Discussions
Betreff: Re: [ckan-dev] CKAN upload custom format tsv on schedule

Hi Christopher,

Something that meets these requirements doesn't exist yet. For #2, you can use datapusher. For #3, you would need a cron job or a scheduled celery task to poll that file periodically and pull it in. For #1, there is nothing, and this part may be hard to build.

The approach that I think you're taking - pulling the file periodically - is a good approach, but sub-optimal because there is a delay between when the data is updated and when your cron job will pull it in.

For your case though, it seems the data is only updated once a month, so it's not that big of a deal. But in general, I think CKAN is still waiting for somebody to build a solution to this type of problem - where data is updated remotely, but you want it to be available and always up-to-date in the datastore.

For example, I have a similar problem with data like this: http://data.ottawa.ca/dataset/recreation-guides/resource/0bed5111-0361-4b1e-8ddc-183428a575ce

That file is updated every 15 minutes with latest data, and I want it in the datastore. If I had the time to do this, I would probably try building something using Postgres Foreign Data Wrappers (http://multicorn.readthedocs.org/en/latest/)

- Denis

On Mon, Aug 10, 2015 at 8:53 AM, Christopher Njuguna <cnjuguna at rwanda.cmu.edu<mailto:cnjuguna at rwanda.cmu.edu>> wrote:

Hi,


I am trying to upload custom formatted data files from the UK climate site e.g.this file<http://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/aberporthdata.txt>. There are 5 lines of metadata and 1 header line, the file is tab-delimited with some special column data.


1) Can CKAN preprocess the file according to a format I give it so that only data are picked up. Possibly saving the metadata in the description?

I would prefer a frontend option because I want users to be able to do this themselves.


2) Is it possible to have a dataset uploaded automatically once the url is entered. I currently have to go to the manage -> datastore page and click on upload to datastore to have the data populated.


3) Can the dataset be updated at a regular interval?


Thanks,


Chris

_______________________________________________
ckan-dev mailing list
ckan-dev at lists.okfn.org<mailto:ckan-dev at lists.okfn.org>
https://lists.okfn.org/mailman/listinfo/ckan-dev
Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20150818/bfe4a771/attachment-0003.html>


More information about the ckan-dev mailing list