[okfn-labs] Frictionless Data Vision and Roadmap

Kev Kirkland kev at dataunity.org
Tue Jan 28 16:43:56 UTC 2014


Hi Rufus and Stefan,

Thanks for the replies, some really good stuff there and want to continue
the discussions. Things are hectic over the next couple of days but will
try to make some time on Thurs for a detailed response.

Thanks,

Kev


On 28 January 2014 15:57, Stefan Urbanek <stefan.urbanek at gmail.com> wrote:

> Hi there,
>
> Thanks Rufus for plugging me in.
>
> On 28.1.2014, at 15:45, Rufus Pollock <rufus.pollock at okfn.org> wrote:
>
> On 23 January 2014 17:08, Kev Kirkland <kev at dataunity.org> wrote:
>
>> Hi Rufus,
>>
>> Great, I like the vision. The cooking metaphor really works - with data
>> we're generally mixing up different ingredients (data) and recipes are
>> often made of several smaller recipes that can be reused elsewhere.
>>
>> In the next few weeks I'll be trying to create a Semantic Web vocab to
>> formalise some of these ideas (I need them for the internals of the project
>> I'm working on to store data queries in an implementation independent
>> format). At the moment the
>>
>
> You might be interested in this piece of existing work:
>
> http://dataprotocols.org/data-query-protocol/
>
> It is still at draft stage and was started a little while back but it
> contained a proposal for a structured JSON-based serialization of queries.
>
>
>> recipes for data processing are often embedded in scripts (like Python or
>> SQL) so it's tricky to get reuse out of them. However if we have a
>> declarative way of specifying the common operations in a dataflow it should
>> make things easier to understand.
>>
>
> I agree - one thing I've been toying with is simple spec for specifying
> simplest kind of operations (head, grep, delete, etc). An immediate
> motivation has been work in DataPipes <http://datapipes.okfn.org/> where
> the library interface is starting to look like this:
>
> var dp = require('datapipes');
> // load data from inUrl, write to outFile after applying the sequence of transformations
> dp.transform(inUrl, outFile, [
>   {
>     operator: 'head',
>     options: {
>       number: 10 // number of rows
>     }
>   },
>   {
>     operator: 'grep',
>     options: {
>       regex: 'london',
>       ignorecase: true
>     }
>   }
>   {
>     operator: 'delete'
>     options: {
>       range: '3,5:10'
>     }
>   }
> ]);
>
>
> See https://github.com/okfn/datapipes/blob/master/docs/dev.md.
>
>
>> Workflow diagrams seem to be a natural 'DSL' for data processing so I'm
>> focussing on those (Directed Acyclic Graphs), similar to Cascading
>> workflows. If anyone else is working in this area, it would be great to
>> pool ideas.
>>
>
> This sounds really interesting - can you give more detail? I'm also cc'ing
> Stefan Urbanek who I know has been building an ETL framework in Python
> http://bubbles.databrewery.org/  and who has done a lot of work on OLAP
> (cf his recent series of posts<http://okfnlabs.org/blog/2014/01/10/olap-introduction.html>
>  (post 2<http://okfnlabs.org/blog/2014/01/20/olap-cubes-and-logical-model.html>) on
> the Labs blog).
>
>
>
> Many commercial tools use the "workflow diagrams" for ETL, data processing
> or mining. From open-source see the Pentaho Kettle, for example or Orange
> visual data mining tool (in Python). Back in the days when I was in mobile
> telco we were using Oracle Data Warehouse Builder, worth seeing at least
> for concepts. Another nice tool for inspiration is IBM Modeler, formerly
> known as SPSS Clementine - data mining tool with many nodes for data
> preparation. There are lots of them out there, with a slightly different
> capabilities, approaches and objectives.
>
> (google for images of the above mentioned tools)
>
> One of the challenges of such tools is requirement for metadata - the more
> you have, the easier is to describe the process. Moreover, I can't imagine
> easy reusability of such tools without having explicit metadata. Looks like
> open-data is finally reaching the point of requirement for metadata (and I
> don't mean the semantic-web-like ones)...but that's for another discussion.
>
> Another problem with traditional ETL tools is that they are pushing data
> around - that might be undesired with large amounts of data, or sometimes
> even with smaller amounts scattered around a slow web. One of the solutions
> for that is to have an ETL tool that operates on metadata purely and sends
> the metadata around, passing data only when required and reusing
> capabilities of native data stores - that is the idea behind Bubbles (still
> in prototype stage unfortunately).
>
> I'll write about bubbles later in more detail, but just to have an idea
> here are some concrete framework function examples:
>
> * you don't have to care whether two tables are joined by python loop or
> using SQL join - you don't have to care about nature of the source at all
> * if you do some data filtering and the source is SQL table then the
> output will be a SQL statement, if the source is a CSV then the output will
> be a python filter iterator - you don't have to care, you treat any source
> just as a table
>
> It is about abstracting the datasets and operations. The engine decides
> how the operation is performed depending on the data representation.
>
> Why is this important? To make the data wrangler focus on his data goal,
> not on the tool, SQL or Python language.
>
>
> I'm especially interested in a way to encapsulate the reusable parts of a
>> data flow. In the cooking metaphor I guess you'd say that a recipe can be
>> the ingredient of another recipe.
>>
>
> That's the ultimate goal to be achieved, however I haven't seen many
> examples of it in (corporate) reality. Maybe that is because the not only
> data but also metadata are very context (time, location, user, goal, ...)
> specific. On the other hand, there are attempts to have kind of "standard
> metadata models", for example on the industry level: telco model, banking
> model, ... but even those are kind of "recommendations" and have to be
> adapted to the target environment (which is sometimes the same amount of
> work as reinventing them from scratch).
>
> If we would like to approach more reusability, we would have to create a
> layer which will generate data AND metadata that can be fed into a common
> tools. I'm not sure yet how that would be achievable, given current
> maturity level of Open Data - just finished the "scrapping phase", entering
> the "ETL" phase with great data,metadata and tools diversity. That's third
> challenge of data in Open Data world. In corporate world, despite working
> with multiple data sources, the number of sources is pretty limited. In the
> Open Data world it is much much wilder, much more inconsistent, much more
> raw...
>
> Using Semantic Web we should have a framework for publishing dataflow
>> logic so we can build up libraries of common processes that can be strung
>> together. I can see it being useful for things like showing how a data set
>> can be cleaned up in an implementation independent way.
>>
>
> Well at one end you just have code itself as the recipe :-) (ie. my ETL
> process is my code, boot that up and run it).
>
>
> Exactly, Rufus! And sometimes it is much easier and more *understandable*
> to write a piece of code than a processing network (despite I am huge
> proponent of them).
>
> Now there are ways we could make that a bit more repeatable (think of
> abusing Travis to run ETL jobs or the way DataExplorer lets you run
> Javascript repeatedly). But what you're talking about is abstracting above
> that - e.g. have a way of speccing particular components in a mini-language
> (whether semantic web based or otherwise). Whilst I'm excited about this
> I'm a bit wary as having gone down this path a bit before I think you can
> rapidly end up re-inventing a programming language :-)
>
>
> Well... I don't think that it is reinventing a programming language. I see
> it rather as a layer above the programming language. The question is: what
> is the most comfortable granularity on the script level and what is
> comfortable on the processing network level? I don't know yet, and if I
> look at the tools around, they don't either.
>
> Just to give an example: IBM Modeler (SPSS Clementine) is a data mining
> tool, where  you have nodes for machine learning such as neural networks or
> segmentation nodes, but you also have nodes for simple data filtering or
> joins. Those simple nodes are meant for slight data treatment before
> mining. However, they are so generic that you can build whole ETL process
> with them. Is it worth it? I guess it would be slow and inefficient if used
> only for ETL purpose - I guess script producing a table or two would be
> much more effective.
>
> The advantages of graphical language for "flow based programming" that I
> see are:
>
> * focus on the data goal
> * better understandability of the process with immediate visual feedback
> to the process modeling person
> * process abstraction - I trust the underlying engine to produce the most
> effective code (be it SQL or directly interpreting it)
>
> Disadvantages:
>
> * you can't use vim to edit such network in a nice and easy way :-)
>
> We can talk about it more, if you are interested,
>
> Cheers,
>
> Stefan
>
> *Twitter:* @Stiivi
> *Personal:* stiivi.com
> * Data Brewery:* databrewery.org
>
>


-- 
www.dataunity.org
twitter: @data_unity
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20140128/711e9493/attachment-0004.html>


More information about the okfn-labs mailing list