[ckan-dev] Something very cool - data pipelines on CKAN

Denis Zgonjanin deniszgonjanin at gmail.com
Mon Sep 21 11:45:09 UTC 2015


Thanks Florian! These notes are hugely invaluable.

- Denis

On Fri, Sep 18, 2015 at 8:48 PM, Florian May <
florian.wendelin.mayer at gmail.com> wrote:

> Thanks Denis!
>
> I thought long and hard about the automation of generating products from
> data resources. I like the idea of sandboxed ipython notebooks! With
> datacats providing a dockerised CKAN, auto-provisioning Jupyter docker
> containers on demand to develop / run iPy notebooks wouldn't be that far
> away, right?
>
> Actually we have a semi-automated solution in place to generate products
> from data. I'll add that to the CKAN-o-Sweave example soon.
> Our users each have access to their own account on an RStudio Server
> instance. Each of them has a local, version-controlled copy of our
> reporting repo (CKAN o' Sweave's precursor) in their RStudio Server account.
> For each dataset and resulting figure(s), we have one R script that reads
> the data (CSV) directly from CKAN, generates one or several figures (as
> PDF) from the data (or runs any kind of analysis we need) and finally
> uploads itself (as TXT) and the figures back to CKAN (overwriting the
> previous versions of code TXT and figure PDF(s)). The script sources a
> secret (gitignored) file with the respective user's CKAN API key (required
> for upload to CKAN using ckanr::resource_update()).
> These scripts are under version control inside our reporting code repo
> (I'll add an example to CKAN-o-Sweave in a similar way).
>
> That provides us with the following situation:
> - creating these scripts is very easy - we simply take working R code
> (producing a figure from a data.frame), prepend it with the "load CKAN
> resource into R data.frame" code, and append the "upload code and products
> back to CKAN" code.
> - to produce the products from data once data are refreshed, we simply
> "source" the R script. That's one user-friendly click of a button in
> RStudio. Doesn't get any easier than that!
> - my researchers have full access to the actual script. I found that every
> layer of abstraction introduces an order of magnitude of possible bugs and
> confusion for end users.
> - we went with R instead of Python as our end users' skill base is more
> around R than Python. YMMV.
> - scripts can be run on demand, or as part of another
> "refresh_all_the_figures.R" script. This allows any automation we require
> later on, but keeps things simple and on-demand at the same time.
> - every user is authenticated with their own CKAN API key, so their
> actions are logged, and we can kick the right behinds in case of mishaps.
> - the scripts of many authors sit in the same code repo as the reports, so
> they are within reach when needed, and serve as a simple library of
> existing solutions to steal from.
>
> We deliberately separated the process of processing data and compiling the
> reports for the following reasons:
> - performance: the products are produced once, and read often. It makes
> sense to compile them when the data comes in, not when they are accessed /
> read.
> - separation of concerns: the products are used not only for the reports,
> so we don't want to tie these two processes too tightly together.
> - scalability: both requirements to the analyses and the reports evolve
> constantly, so any level of abstraction and any tight integration creates a
> maintenance footprint. Keeping things light-weight allows to "move fast and
> break things".
> - qa: having a human brain between data coming in (which is a manual
> process to update the CKAN data resource anyways) and making sure the
> product still makes sense is hard to automate. My users like feeling in
> charge of the QA step and being able to fiddle with the product where and
> when necessary - especially when their name (as the dataset maintainer) and
> "last updated on" date signs off their update to the CKAN dataset.
>
> So yeah, simple user-sourced R scripts are a pretty low-key solution, but
> for us and in the context of a large reporting scenario it's the sweet spot
> between automation, having human common sense in between components to
> review and QA, scalability and flexibility.
>
> In contrast to this large use-case (a dozen reports of 1500+ pages using
> 600+ CKAN resources), compiling products at the same time as reports would
> make a ton of sense for smaller reporting projects, e.g. research papers,
> technical appendices, R package vignettes and the like.
>
> Finally, where there are too many possible permutations of input settings
> to produce the desired output (and we can't pre-compile one product to fit
> all needs), it's not too hard to wrap an R script into an RShiny app, such
> as my much hawked-out timeseries explorer
> http://rshiny.yes-we-ckan.org/shiny-timeseries/
>
> So yeah, an iPython notebook server would be a useful addition to our
> data.wa.gov.au stack. (Similarly, an "open this spatial resource in QGIS
> server" button would tickle many fancies I guess.)
> I'll have to look into adding https://github.com/jupyter/jupyterhub to
> http://govhack2015.readthedocs.org/en/latest/3_Workbench/ to spawn
> ipython notebook containers. Any help or lessons learned in that space
> would be appreciated!
>
> Cheers,
> Florian
>
>
>
> On Fri, Sep 18, 2015 at 10:39 PM, Denis Zgonjanin <
> deniszgonjanin at gmail.com> wrote:
>
>> Great job Florian! Making all those pieces work end-to-end must have been
>> no small feat.
>>
>> Have you thought about auto-refreshing plots and other output when new
>> data comes in, pushing those back into CKAN as well?
>>
>> It's something I've been curious about for a while - for example, letting
>> users upload a sandboxed script. Or, specify an ipython notebook which
>> takes in a dataset resource as the data input, and generates a ResourceView
>> compatible output. Then every time the resource is updated, the script
>> would be re-ran automatically. Seeing all you've done here, you must have
>> thought about something similar, so I'm curious to hear your thoughts.
>>
>> - Denis
>>
>> On Fri, Sep 18, 2015 at 2:30 AM, Steven De Costa <
>> steven.decosta at linkdigital.com.au> wrote:
>>
>>> I'm sure a range of people will find this very interesting:
>>>
>>> http://ckan.org/2015/09/18/pyramids-pipelines-and-a-can-of-sweave-ckan-asia-pacific-meetup/
>>>
>>> What I like about this is the end product for the PDF report. This is
>>> something technical influencers within an organisation can show their group
>>> managers to push for greater resources around their CKAN data management
>>> system.
>>>
>>> There is a lot of effort that goes into reports so the ability to render
>>> these from datasets and curated figures is very cool.
>>>
>>> Fantastic work by Florian Mayer!
>>>
>>> Cheers,
>>> Steven
>>>
>>> *STEVEN DE COSTA *|
>>> *EXECUTIVE DIRECTOR*www.linkdigital.com.au
>>>
>>>
>>>
>>> _______________________________________________
>>> ckan-dev mailing list
>>> ckan-dev at lists.okfn.org
>>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>>
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>>
>>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20150921/9b0843ea/attachment-0003.html>


More information about the ckan-dev mailing list