[okfn-labs] Fwd: Frictionless Data Vision and Roadmap
Stefan Urbanek
stefan.urbanek at gmail.com
Tue Jan 28 16:43:53 UTC 2014
(This is a copy of my original reply to the okfn-labs, as the first one bounced)
Hi there,
Thanks Rufus for plugging me in,
On 28.1.2014, at 15:45, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> On 23 January 2014 17:08, Kev Kirkland <kev at dataunity.org> wrote:
> Hi Rufus,
>
> Great, I like the vision. The cooking metaphor really works - with data we're generally mixing up different ingredients (data) and recipes are often made of several smaller recipes that can be reused elsewhere.
>
> In the next few weeks I'll be trying to create a Semantic Web vocab to formalise some of these ideas (I need them for the internals of the project I'm working on to store data queries in an implementation independent format). At the moment the
>
> You might be interested in this piece of existing work:
>
> http://dataprotocols.org/data-query-protocol/
>
> It is still at draft stage and was started a little while back but it contained a proposal for a structured JSON-based serialization of queries.
>
> recipes for data processing are often embedded in scripts (like Python or SQL) so it's tricky to get reuse out of them. However if we have a declarative way of specifying the common operations in a dataflow it should make things easier to understand.
>
> I agree - one thing I've been toying with is simple spec for specifying simplest kind of operations (head, grep, delete, etc). An immediate motivation has been work in DataPipes where the library interface is starting to look like this:
> var dp = require('datapipes');
> // load data from inUrl, write to outFile after applying the sequence of transformations
> dp.transform(inUrl, outFile, [
> {
> operator: 'head',
> options: {
> number: 10 // number of rows
> }
> },
> {
> operator: 'grep',
> options: {
> regex: 'london',
> ignorecase: true
> }
> }
> {
> operator: 'delete'
> options: {
> range: '3,5:10'
> }
> }
> ]);
>
> See https://github.com/okfn/datapipes/blob/master/docs/dev.md.
>
> Workflow diagrams seem to be a natural 'DSL' for data processing so I'm focussing on those (Directed Acyclic Graphs), similar to Cascading workflows. If anyone else is working in this area, it would be great to pool ideas.
>
> This sounds really interesting - can you give more detail? I'm also cc'ing Stefan Urbanek who I know has been building an ETL framework in Python http://bubbles.databrewery.org/ and who has done a lot of work on OLAP (cf his recent series of posts (post 2) on the Labs blog).
>
Many commercial tools use the "workflow diagrams" for ETL, data processing or mining. From open-source see the Pentaho Kettle, for example or Orange visual data mining tool (in Python). Back in the days when I was in mobile telco we were using Oracle Data Warehouse Builder, worth seeing at least for concepts. Another nice tool for inspiration is IBM Modeler, formerly known as SPSS Clementine – data mining tool with many nodes for data preparation. There are lots of them out there, with a slightly different capabilities, approaches and objectives.
(google for images of the above mentioned tools)
One of the challenges of such tools is requirement for metadata – the more you have, the easier is to describe the process. Moreover, I can't imagine easy reusability of such tools without having explicit metadata. Looks like open-data is finally reaching the point of requirement for metadata (and I don't mean the semantic-web-like ones)...but that's for another discussion.
Another problem with traditional ETL tools is that they are pushing data around – that might be undesired with large amounts of data, or sometimes even with smaller amounts scattered around a slow web. One of the solutions for that is to have an ETL tool that operates on metadata purely and sends the metadata around, passing data only when required and reusing capabilities of native data stores – that is the idea behind Bubbles (still in prototype stage unfortunately).
I'll write about bubbles later in more detail, but just to have an idea here are some concrete framework function examples:
* you don't have to care whether two tables are joined by python loop or using SQL join - you don't have to care about nature of the source at all
* if you do some data filtering and the source is SQL table then the output will be a SQL statement, if the source is a CSV then the output will be a python filter iterator – you don't have to care, you treat any source just as a table
It is about abstracting the datasets and operations. The engine decides how the operation is performed depending on the data representation.
Why is this important? To make the data wrangler focus on his data goal, not on the tool, SQL or Python language.
> I'm especially interested in a way to encapsulate the reusable parts of a data flow. In the cooking metaphor I guess you'd say that a recipe can be the ingredient of another recipe.
That's the ultimate goal to be achieved, however I haven't seen many examples of it in (corporate) reality. Maybe that is because the not only data but also metadata are very context (time, location, user, goal, ...) specific. On the other hand, there are attempts to have kind of "standard metadata models", for example on the industry level: telco model, banking model, ... but even those are kind of "recommendations" and have to be adapted to the target environment (which is sometimes the same amount of work as reinventing them from scratch).
If we would like to approach more reusability, we would have to create a layer which will generate data AND metadata that can be fed into a common tools. I'm not sure yet how that would be achievable, given current maturity level of Open Data – just finished the "scrapping phase", entering the "ETL" phase with great data,metadata and tools diversity. That's third challenge of data in Open Data world. In corporate world, despite working with multiple data sources, the number of sources is pretty limited. In the Open Data world it is much much wilder, much more inconsistent, much more raw...
> Using Semantic Web we should have a framework for publishing dataflow logic so we can build up libraries of common processes that can be strung together. I can see it being useful for things like showing how a data set can be cleaned up in an implementation independent way.
>
> Well at one end you just have code itself as the recipe :-) (ie. my ETL process is my code, boot that up and run it).
Exactly, Rufus! And sometimes it is much easier and more *understandable* to write a piece of code than a processing network (despite I am huge proponent of them).
> Now there are ways we could make that a bit more repeatable (think of abusing Travis to run ETL jobs or the way DataExplorer lets you run Javascript repeatedly). But what you're talking about is abstracting above that - e.g. have a way of speccing particular components in a mini-language (whether semantic web based or otherwise). Whilst I'm excited about this I'm a bit wary as having gone down this path a bit before I think you can rapidly end up re-inventing a programming language :-)
Well... I don't think that it is reinventing a programming language. I see it rather as a layer above the programming language. The question is: what is the most comfortable granularity on the script level and what is comfortable on the processing network level? I don't know yet, and if I look at the tools around, they don't either.
Just to give an example: IBM Modeler (SPSS Clementine) is a data mining tool, where you have nodes for machine learning such as neural networks or segmentation nodes, but you also have nodes for simple data filtering or joins. Those simple nodes are meant for slight data treatment before mining. However, they are so generic that you can build whole ETL process with them. Is it worth it? I guess it would be slow and inefficient if used only for ETL purpose – I guess script producing a table or two would be much more effective.
The advantages of graphical language for "flow based programming" that I see are:
* focus on the data goal
* better understandability of the process with immediate visual feedback to the process modeling person
* process abstraction – I trust the underlying engine to produce the most effective code (be it SQL or directly interpreting it)
Disadvantages:
* you can't use vim to edit such network in a nice and easy way :-)
We can talk about it more, if you are interested,
Cheers,
Stefan
Twitter: @Stiivi
Personal: stiivi.com
Data Brewery: databrewery.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20140128/0249002b/attachment-0003.html>
More information about the okfn-labs
mailing list