[okfn-labs] Frictionless Data Vision and Roadmap
Rufus Pollock
rufus.pollock at okfn.org
Tue Jan 28 14:45:13 UTC 2014
On 23 January 2014 17:08, Kev Kirkland <kev at dataunity.org> wrote:
> Hi Rufus,
>
> Great, I like the vision. The cooking metaphor really works - with data
> we're generally mixing up different ingredients (data) and recipes are
> often made of several smaller recipes that can be reused elsewhere.
>
> In the next few weeks I'll be trying to create a Semantic Web vocab to
> formalise some of these ideas (I need them for the internals of the project
> I'm working on to store data queries in an implementation independent
> format). At the moment the
>
You might be interested in this piece of existing work:
http://dataprotocols.org/data-query-protocol/
It is still at draft stage and was started a little while back but it
contained a proposal for a structured JSON-based serialization of queries.
> recipes for data processing are often embedded in scripts (like Python or
> SQL) so it's tricky to get reuse out of them. However if we have a
> declarative way of specifying the common operations in a dataflow it should
> make things easier to understand.
>
I agree - one thing I've been toying with is simple spec for specifying
simplest kind of operations (head, grep, delete, etc). An immediate
motivation has been work in DataPipes <http://datapipes.okfn.org/> where
the library interface is starting to look like this:
var dp = require('datapipes');
// load data from inUrl, write to outFile after applying the sequence
of transformations
dp.transform(inUrl, outFile, [
{
operator: 'head',
options: {
number: 10 // number of rows
}
},
{
operator: 'grep',
options: {
regex: 'london',
ignorecase: true
}
}
{
operator: 'delete'
options: {
range: '3,5:10'
}
}
]);
See https://github.com/okfn/datapipes/blob/master/docs/dev.md.
> Workflow diagrams seem to be a natural 'DSL' for data processing so I'm
> focussing on those (Directed Acyclic Graphs), similar to Cascading
> workflows. If anyone else is working in this area, it would be great to
> pool ideas.
>
This sounds really interesting - can you give more detail? I'm also cc'ing
Stefan Urbanek who I know has been building an ETL framework in Python
http://bubbles.databrewery.org/ and who has done a lot of work on OLAP (cf
his recent series of
posts<http://okfnlabs.org/blog/2014/01/10/olap-introduction.html>
(post 2<http://okfnlabs.org/blog/2014/01/20/olap-cubes-and-logical-model.html>)
on
the Labs blog).
> I'm especially interested in a way to encapsulate the reusable parts of a
> data flow. In the cooking metaphor I guess you'd say that a recipe can be
> the ingredient of another recipe. Using Semantic Web we should have a
> framework for publishing dataflow logic so we can build up libraries of
> common processes that can be strung together. I can see it being useful for
> things like showing how a data set can be cleaned up in an implementation
> independent way.
>
Well at one end you just have code itself as the recipe :-) (ie. my ETL
process is my code, boot that up and run it). Now there are ways we could
make that a bit more repeatable (think of abusing Travis to run ETL jobs or
the way DataExplorer lets you run Javascript repeatedly). But what you're
talking about is abstracting above that - e.g. have a way of speccing
particular components in a mini-language (whether semantic web based or
otherwise). Whilst I'm excited about this I'm a bit wary as having gone
down this path a bit before I think you can rapidly end up re-inventing a
programming language :-)
> Sorry I can't make the Labs Hangout, hope it goes well.
>
It did go well :-) Hope you can come to the next one and you'd be very
welcome to present (its on the 3rd Thursday of the month - see the events
calendar <http://okfnlabs.org/events/>)
Rufus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20140128/76dd1687/attachment-0004.html>
More information about the okfn-labs
mailing list