[okfn-labs] Fwd: Frictionless Data Vision and Roadmap

Friedrich Lindenberg friedrich at pudo.org
Tue Feb 4 19:15:09 UTC 2014


Big boys, coming
through: [1]http://www.trifacta.com/company/news/trifacta-launches-data
-transformation-platform/





On Tue, Jan 28, 2014, at 08:43 AM, Stefan Urbanek wrote:

(This is a copy of my original reply to the okfn-labs, as the first one
bounced)

Hi there,


Thanks Rufus for plugging me in,

On 28.1.2014, at 15:45, Rufus Pollock <[2]rufus.pollock at okfn.org>
wrote:

On 23 January 2014 17:08, Kev Kirkland <[3]kev at dataunity.org> wrote:

Hi Rufus,

Great, I like the vision. The cooking metaphor really works - with data
we're generally mixing up different ingredients (data) and recipes are
often made of several smaller recipes that can be reused elsewhere.

In the next few weeks I'll be trying to create a Semantic Web vocab to
formalise some of these ideas (I need them for the internals of the
project I'm working on to store data queries in an implementation
independent format). At the moment the


You might be interested in this piece of existing work:

[4]http://dataprotocols.org/data-query-protocol/

It is still at draft stage and was started a little while back but it
contained a proposal for a structured JSON-based serialization of
queries.

recipes for data processing are often embedded in scripts (like Python
or SQL) so it's tricky to get reuse out of them. However if we have a
declarative way of specifying the common operations in a dataflow it
should make things easier to understand.


I agree - one thing I've been toying with is simple spec for specifying
simplest kind of operations (head, grep, delete, etc). An immediate
motivation has been work in [5]DataPipes where the library interface is
starting to look like this:
var dp = require('datapipes');
// load data from inUrl, write to outFile after applying the sequence of transfo
rmations
dp.transform(inUrl, outFile, [
  {
 operator: 'head',
 options: {
   number: 10 // number of rows
 }
  },
  {
 operator: 'grep',
 options: {
   regex: 'london',
   ignorecase: true
 }
  }
  {
 operator: 'delete'
 options: {
   range: '3,5:10'
 }
  }
]);


See [6]https://github.com/okfn/datapipes/blob/master/docs/dev.md.

Workflow diagrams seem to be a natural 'DSL' for data processing so I'm
focussing on those (Directed Acyclic Graphs), similar to Cascading
workflows. If anyone else is working in this area, it would be great to
pool ideas.


This sounds really interesting - can you give more detail? I'm also
cc'ing Stefan Urbanek who I know has been building an ETL framework in
Python [7]http://bubbles.databrewery.org/  and who has done a lot of
work on OLAP (cf his recent series of [8]posts ([9]post 2) on the Labs
blog).


Many commercial tools use the "workflow diagrams" for ETL, data
processing or mining. From open-source see the Pentaho Kettle, for
example or Orange visual data mining tool (in Python). Back in the days
when I was in mobile telco we were using Oracle Data Warehouse Builder,
worth seeing at least for concepts. Another nice tool for inspiration
is IBM Modeler, formerly known as SPSS Clementine – data mining tool
with many nodes for data preparation. There are lots of them out there,
with a slightly different capabilities, approaches and objectives.

(google for images of the above mentioned tools)

One of the challenges of such tools is requirement for metadata – the
more you have, the easier is to describe the process. Moreover, I can't
imagine easy reusability of such tools without having explicit
metadata. Looks like open-data is finally reaching the point of
requirement for metadata (and I don't mean the semantic-web-like
ones)...but that's for another discussion.

Another problem with traditional ETL tools is that they are pushing
data around – that might be undesired with large amounts of data, or
sometimes even with smaller amounts scattered around a slow web. One of
the solutions for that is to have an ETL tool that operates on metadata
purely and sends the metadata around, passing data only when required
and reusing capabilities of native data stores – that is the idea
behind Bubbles (still in prototype stage unfortunately).

I'll write about bubbles later in more detail, but just to have an idea
here are some concrete framework function examples:

* you don't have to care whether two tables are joined by python loop
or using SQL join - you don't have to care about nature of the source
at all
* if you do some data filtering and the source is SQL table then the
output will be a SQL statement, if the source is a CSV then the output
will be a python filter iterator – you don't have to care, you treat
any source just as a table

It is about abstracting the datasets and operations. The engine decides
how the operation is performed depending on the data representation.

Why is this important? To make the data wrangler focus on his data
goal, not on the tool, SQL or Python language.


I'm especially interested in a way to encapsulate the reusable parts of
a data flow. In the cooking metaphor I guess you'd say that a recipe
can be the ingredient of another recipe.


That's the ultimate goal to be achieved, however I haven't seen many
examples of it in (corporate) reality. Maybe that is because the not
only data but also metadata are very context (time, location, user,
goal, ...) specific. On the other hand, there are attempts to have kind
of "standard metadata models", for example on the industry level: telco
model, banking model, ... but even those are kind of "recommendations"
and have to be adapted to the target environment (which is sometimes
the same amount of work as reinventing them from scratch).

If we would like to approach more reusability, we would have to create
a layer which will generate data AND metadata that can be fed into a
common tools. I'm not sure yet how that would be achievable, given
current maturity level of Open Data – just finished the "scrapping
phase", entering the "ETL" phase with great data,metadata and tools
diversity. That's third challenge of data in Open Data world. In
corporate world, despite working with multiple data sources, the number
of sources is pretty limited. In the Open Data world it is much much
wilder, much more inconsistent, much more raw...

Using Semantic Web we should have a framework for publishing dataflow
logic so we can build up libraries of common processes that can be
strung together. I can see it being useful for things like showing how
a data set can be cleaned up in an implementation independent way.


Well at one end you just have code itself as the recipe :-) (ie. my ETL
process is my code, boot that up and run it).


Exactly, Rufus! And sometimes it is much easier and more
*understandable* to write a piece of code than a processing network
(despite I am huge proponent of them).

Now there are ways we could make that a bit more repeatable (think of
abusing Travis to run ETL jobs or the way DataExplorer lets you run
Javascript repeatedly). But what you're talking about is abstracting
above that - e.g. have a way of speccing particular components in a
mini-language (whether semantic web based or otherwise). Whilst I'm
excited about this I'm a bit wary as having gone down this path a bit
before I think you can rapidly end up re-inventing a programming
language :-)


Well... I don't think that it is reinventing a programming language. I
see it rather as a layer above the programming language. The question
is: what is the most comfortable granularity on the script level and
what is comfortable on the processing network level? I don't know yet,
and if I look at the tools around, they don't either.

Just to give an example: IBM Modeler (SPSS Clementine) is a data mining
tool, where  you have nodes for machine learning such as neural
networks or segmentation nodes, but you also have nodes for simple data
filtering or joins. Those simple nodes are meant for slight data
treatment before mining. However, they are so generic that you can
build whole ETL process with them. Is it worth it? I guess it would be
slow and inefficient if used only for ETL purpose – I guess script
producing a table or two would be much more effective.

The advantages of graphical language for "flow based programming" that
I see are:

* focus on the data goal
* better understandability of the process with immediate visual
feedback to the process modeling person
* process abstraction – I trust the underlying engine to produce the
most effective code (be it SQL or directly interpreting it)

Disadvantages:

* you can't use vim to edit such network in a nice and easy way :-)

We can talk about it more, if you are interested,

Cheers,

Stefan

Twitter: @Stiivi
Personal: [10]stiivi.com
Data Brewery: [11]databrewery.org


_______________________________________________

okfn-labs mailing list

[12]okfn-labs at lists.okfn.org

[13]https://lists.okfn.org/mailman/listinfo/okfn-labs

Unsubscribe: [14]https://lists.okfn.org/mailman/options/okfn-labs

References

1. http://www.trifacta.com/company/news/trifacta-launches-data-transformation-platform/
2. mailto:rufus.pollock at okfn.org
3. mailto:kev at dataunity.org
4. http://dataprotocols.org/data-query-protocol/
5. http://datapipes.okfn.org/
6. https://github.com/okfn/datapipes/blob/master/docs/dev.md
7. http://bubbles.databrewery.org/
8. http://okfnlabs.org/blog/2014/01/10/olap-introduction.html
9. http://okfnlabs.org/blog/2014/01/20/olap-cubes-and-logical-model.html
  10. http://stiivi.com/
  11. http://databrewery.org/
  12. mailto:okfn-labs at lists.okfn.org
  13. https://lists.okfn.org/mailman/listinfo/okfn-labs
  14. https://lists.okfn.org/mailman/options/okfn-labs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20140204/6b54b3c7/attachment-0003.html>


More information about the okfn-labs mailing list