[okfn-labs] Frictionless Data Vision and Roadmap

Rufus Pollock rufus.pollock at okfn.org
Tue Jan 28 14:45:13 UTC 2014

On 23 January 2014 17:08, Kev Kirkland <kev at dataunity.org> wrote:

> Hi Rufus,
> Great, I like the vision. The cooking metaphor really works - with data
> we're generally mixing up different ingredients (data) and recipes are
> often made of several smaller recipes that can be reused elsewhere.
> In the next few weeks I'll be trying to create a Semantic Web vocab to
> formalise some of these ideas (I need them for the internals of the project
> I'm working on to store data queries in an implementation independent
> format). At the moment the

You might be interested in this piece of existing work:


It is still at draft stage and was started a little while back but it
contained a proposal for a structured JSON-based serialization of queries.

> recipes for data processing are often embedded in scripts (like Python or
> SQL) so it's tricky to get reuse out of them. However if we have a
> declarative way of specifying the common operations in a dataflow it should
> make things easier to understand.

I agree - one thing I've been toying with is simple spec for specifying
simplest kind of operations (head, grep, delete, etc). An immediate
motivation has been work in DataPipes <http://datapipes.okfn.org/> where
the library interface is starting to look like this:

var dp = require('datapipes');
// load data from inUrl, write to outFile after applying the sequence
of transformations
dp.transform(inUrl, outFile, [
    operator: 'head',
    options: {
      number: 10 // number of rows
    operator: 'grep',
    options: {
      regex: 'london',
      ignorecase: true
    operator: 'delete'
    options: {
      range: '3,5:10'

See https://github.com/okfn/datapipes/blob/master/docs/dev.md.

> Workflow diagrams seem to be a natural 'DSL' for data processing so I'm
> focussing on those (Directed Acyclic Graphs), similar to Cascading
> workflows. If anyone else is working in this area, it would be great to
> pool ideas.

This sounds really interesting - can you give more detail? I'm also cc'ing
Stefan Urbanek who I know has been building an ETL framework in Python
http://bubbles.databrewery.org/  and who has done a lot of work on OLAP (cf
his recent series of
 (post 2<http://okfnlabs.org/blog/2014/01/20/olap-cubes-and-logical-model.html>)
the Labs blog).

> I'm especially interested in a way to encapsulate the reusable parts of a
> data flow. In the cooking metaphor I guess you'd say that a recipe can be
> the ingredient of another recipe. Using Semantic Web we should have a
> framework for publishing dataflow logic so we can build up libraries of
> common processes that can be strung together. I can see it being useful for
> things like showing how a data set can be cleaned up in an implementation
> independent way.

Well at one end you just have code itself as the recipe :-) (ie. my ETL
process is my code, boot that up and run it). Now there are ways we could
make that a bit more repeatable (think of abusing Travis to run ETL jobs or
the way DataExplorer lets you run Javascript repeatedly). But what you're
talking about is abstracting above that - e.g. have a way of speccing
particular components in a mini-language (whether semantic web based or
otherwise). Whilst I'm excited about this I'm a bit wary as having gone
down this path a bit before I think you can rapidly end up re-inventing a
programming language :-)

> Sorry I can't make the Labs Hangout, hope it goes well.

It did go well :-) Hope you can come to the next one and you'd be very
welcome to present (its on the 3rd Thursday of the month - see the events
calendar <http://okfnlabs.org/events/>)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20140128/76dd1687/attachment-0004.html>

More information about the okfn-labs mailing list