[ckan-dev] Something very cool - data pipelines on CKAN

Florian May florian.wendelin.mayer at gmail.com
Sat Sep 19 00:48:18 UTC 2015


Thanks Denis!

I thought long and hard about the automation of generating products from
data resources. I like the idea of sandboxed ipython notebooks! With
datacats providing a dockerised CKAN, auto-provisioning Jupyter docker
containers on demand to develop / run iPy notebooks wouldn't be that far
away, right?

Actually we have a semi-automated solution in place to generate products
from data. I'll add that to the CKAN-o-Sweave example soon.
Our users each have access to their own account on an RStudio Server
instance. Each of them has a local, version-controlled copy of our
reporting repo (CKAN o' Sweave's precursor) in their RStudio Server account.
For each dataset and resulting figure(s), we have one R script that reads
the data (CSV) directly from CKAN, generates one or several figures (as
PDF) from the data (or runs any kind of analysis we need) and finally
uploads itself (as TXT) and the figures back to CKAN (overwriting the
previous versions of code TXT and figure PDF(s)). The script sources a
secret (gitignored) file with the respective user's CKAN API key (required
for upload to CKAN using ckanr::resource_update()).
These scripts are under version control inside our reporting code repo
(I'll add an example to CKAN-o-Sweave in a similar way).

That provides us with the following situation:
- creating these scripts is very easy - we simply take working R code
(producing a figure from a data.frame), prepend it with the "load CKAN
resource into R data.frame" code, and append the "upload code and products
back to CKAN" code.
- to produce the products from data once data are refreshed, we simply
"source" the R script. That's one user-friendly click of a button in
RStudio. Doesn't get any easier than that!
- my researchers have full access to the actual script. I found that every
layer of abstraction introduces an order of magnitude of possible bugs and
confusion for end users.
- we went with R instead of Python as our end users' skill base is more
around R than Python. YMMV.
- scripts can be run on demand, or as part of another
"refresh_all_the_figures.R" script. This allows any automation we require
later on, but keeps things simple and on-demand at the same time.
- every user is authenticated with their own CKAN API key, so their actions
are logged, and we can kick the right behinds in case of mishaps.
- the scripts of many authors sit in the same code repo as the reports, so
they are within reach when needed, and serve as a simple library of
existing solutions to steal from.

We deliberately separated the process of processing data and compiling the
reports for the following reasons:
- performance: the products are produced once, and read often. It makes
sense to compile them when the data comes in, not when they are accessed /
read.
- separation of concerns: the products are used not only for the reports,
so we don't want to tie these two processes too tightly together.
- scalability: both requirements to the analyses and the reports evolve
constantly, so any level of abstraction and any tight integration creates a
maintenance footprint. Keeping things light-weight allows to "move fast and
break things".
- qa: having a human brain between data coming in (which is a manual
process to update the CKAN data resource anyways) and making sure the
product still makes sense is hard to automate. My users like feeling in
charge of the QA step and being able to fiddle with the product where and
when necessary - especially when their name (as the dataset maintainer) and
"last updated on" date signs off their update to the CKAN dataset.

So yeah, simple user-sourced R scripts are a pretty low-key solution, but
for us and in the context of a large reporting scenario it's the sweet spot
between automation, having human common sense in between components to
review and QA, scalability and flexibility.

In contrast to this large use-case (a dozen reports of 1500+ pages using
600+ CKAN resources), compiling products at the same time as reports would
make a ton of sense for smaller reporting projects, e.g. research papers,
technical appendices, R package vignettes and the like.

Finally, where there are too many possible permutations of input settings
to produce the desired output (and we can't pre-compile one product to fit
all needs), it's not too hard to wrap an R script into an RShiny app, such
as my much hawked-out timeseries explorer
http://rshiny.yes-we-ckan.org/shiny-timeseries/

So yeah, an iPython notebook server would be a useful addition to our
data.wa.gov.au stack. (Similarly, an "open this spatial resource in QGIS
server" button would tickle many fancies I guess.)
I'll have to look into adding https://github.com/jupyter/jupyterhub to
http://govhack2015.readthedocs.org/en/latest/3_Workbench/ to spawn ipython
notebook containers. Any help or lessons learned in that space would be
appreciated!

Cheers,
Florian



On Fri, Sep 18, 2015 at 10:39 PM, Denis Zgonjanin <deniszgonjanin at gmail.com>
wrote:

> Great job Florian! Making all those pieces work end-to-end must have been
> no small feat.
>
> Have you thought about auto-refreshing plots and other output when new
> data comes in, pushing those back into CKAN as well?
>
> It's something I've been curious about for a while - for example, letting
> users upload a sandboxed script. Or, specify an ipython notebook which
> takes in a dataset resource as the data input, and generates a ResourceView
> compatible output. Then every time the resource is updated, the script
> would be re-ran automatically. Seeing all you've done here, you must have
> thought about something similar, so I'm curious to hear your thoughts.
>
> - Denis
>
> On Fri, Sep 18, 2015 at 2:30 AM, Steven De Costa <
> steven.decosta at linkdigital.com.au> wrote:
>
>> I'm sure a range of people will find this very interesting:
>>
>> http://ckan.org/2015/09/18/pyramids-pipelines-and-a-can-of-sweave-ckan-asia-pacific-meetup/
>>
>> What I like about this is the end product for the PDF report. This is
>> something technical influencers within an organisation can show their group
>> managers to push for greater resources around their CKAN data management
>> system.
>>
>> There is a lot of effort that goes into reports so the ability to render
>> these from datasets and curated figures is very cool.
>>
>> Fantastic work by Florian Mayer!
>>
>> Cheers,
>> Steven
>>
>> *STEVEN DE COSTA *|
>> *EXECUTIVE DIRECTOR*www.linkdigital.com.au
>>
>>
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20150919/1f0ea337/attachment-0003.html>


More information about the ckan-dev mailing list