[okfn-labs] Datapackage + Bubbles Demo

Stefan Urbanek stefan.urbanek at gmail.com
Sat Feb 22 18:13:37 UTC 2014


On 21 Feb 2014, at 15:55, Rufus Pollock <rufus.pollock at okfn.org> wrote:

> On 20 February 2014 00:58, Stefan Urbanek <stefan.urbanek at gmail.com> wrote:
> Hi,
> 
> Here is a short demo of Bubbles[1] using Data Package collection store:
> 
> 	https://gist.github.com/Stiivi/9104719
> 
> This is fantastic! Not only is this directly useful but also great example of the tooling integration we want more of.
> 
> I've just created a wiki page here: https://github.com/okfn/data.okfn.org/wiki where we can list examples like this (and then migrate them as merited to data.okfn.org main site - e.g. at http://data.okfn.org/tools)
> 

Added.

> In the Gist you will find the example python code, list of required datasets and their modifications and also stripped example output.
> 
> The example is artificial, but at least shows:
> 
> * how datapackage store is used – how to access datapackage resources as data objects
> * how Pipeline is constructed
> * simple master-detail join
> * aggregation with composite key
> 
> The “Data Package collection store” is a directory with datapackages in it. Data objects are named "PACKAGE.RESOURCE", if package has only one resource then just “PACKAGE".
> 
> I note that in the spec for dpm (data package manager tool) installed data packages go in a subdirectory `datapackages` - see https://github.com/okfn/dpm/issues/3 for more.

Good to know, I’ll add that.

>  
> Note that if the same code was run on top of a SQL database source, then SQL queries (or maybe just one in this case) would be composed and executed instead of Python iterators. Transparently.
> 
> So you don't actually load the data package into a DB to do this? That's interesting - it would also be nice to autoload data packages into the relational DB (see relational databases in http://data.okfn.org/roadmap)
>  

In the original example: not at all! See/try this new gist, where I added new SQL source and just copied the data into SQL database tables BEFORE processing them:

	https://gist.github.com/Stiivi/9159092 (pull the latest bubbles github master before executing it)

Commend out the “create” lines for comparison.

In the python version: 3 iterations over data happen:

1. the branched country-codes CSV is iterated when constructing JOIN iterator (not executed at this time)
2. iteration of the JOIN for aggregation
3. iteration of sort wrapped aggregation for pretty_print

In SQL version: 2 inserts happen (ignore those, we just need them for this example) and only 1 SELECT statement.

More details are in this spreadsheet:

	https://www.icloud.com/iw/#numbers/BAJSyZKOCKjK5w74_LKBMml-wO1VzNrExO6E/Bubbles_-_Datapackage_Demo

That is the point of bubbles: keep the data in their most natural form, don’t move them around if you do not have to. If we can compose a SQL statement without pulling the data out, we just do it.

> My observation during the development: The Data Package and Simple Data Format is great. It just needs a bit refinement and confrontation with real uses (by tools, not human eyes). It needs to focus more on machine-processable and easy-to-use metadata.
> 
> That's good to hear and there are definitely improvements to be made. This kind of usage is exactly what helps improve!
> 

Will try to do more. The standard has big potential.

Cheers,

Stefan

Twitter: @Stiivi
Personal: stiivi.com
Data Brewery: databrewery.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20140222/2fa88146/attachment-0004.html>


More information about the okfn-labs mailing list