[data-protocols] Civic Knowledge Data Bundles Initial Code Release

Eric Busboom eric at clarinova.com
Tue Jul 3 19:30:46 BST 2012


I'm working on a data packaging format called Data Bundles ( http://clarinova.com/bundles ) This package format has some overlap with CKAN Data Packages, but with a different set of requirements. ( Our Bundles are designed to be automatically installed in a relational data warehouse. )  I think we can  harmonize our Bundles with  Data Packages, or at least allow generating a Package from our Bundle Source code. 

( Sorry about the name, I recently noticed that Data Packages also uses the name "Bundles". We'll change our name to not conflict, after I think of something better. ) 

The first release of the code is available for preview at: 

	http://pypi.python.org/pypi/databundles/0.0.6

This really early code, and only implements a few of the features, but it will give you a sense of our approach. 

You can get the code for some of the bundles we are working on packaging at: 

	https://github.com/clarinova/civicdata

This code includes bundles for the US Census Summary file 1, State GDP for the US Bureau of Economic Analysis, and some test bundles.  Here is an example of building the BEA file, after unpacking the bundle code from github. 

$ cd civicdata/bea.gov/bea.gov-metro_gdp-orig
$ python bundle.py
LOG:  ---- Preparing ----
LOG:  Downloading http://bea.gov/regional/zip/GDPMetro.zip
LOG:  Extracting/Volumes/Storage/proj/github.com/civicdata/bea.gov/bea.gov-metro_gdp-orig/downloads/allgmp.csv from /Volumes/Storage/proj/github.com/civicdata/bea.gov/bea.gov-metro_gdp-orig/downloads/http%3A%2F%2Fbea.gov%2Fregional%2Fzip%2FGDPMetro.zip
LOG:  ---- Done Preparing ----
LOG:  ---- Build ---
LOG:  Extracting/Volumes/Storage/proj/github.com/civicdata/bea.gov/bea.gov-metro_gdp-orig/downloads/allgmp.csv from /Volumes/Storage/proj/github.com/civicdata/bea.gov/bea.gov-metro_gdp-orig/downloads/http%3A%2F%2Fbea.gov%2Fregional%2Fzip%2FGDPMetro.zip
LOG:  ---- Done Building ---
LOG:  ---- Install ---
LOG:  ---- Done Installing ---
LOG:  ---- Skipping Submit ---- 

Now, the bundle is in the build directory. It's a sqlite3 file: 

$ sqlite3 build/bea.gov/metro_gdp-orig-a7d9-r1.db
sqlite> .tables
columns     datasets    metro_gdp   tables    
config      files       partitions
sqlite> select geoname, 2001, 20010 from metro_gdp limit 5;
U.S. Metropolitan Portion|2001|20010
U.S. Metropolitan Portion|2001|20010
U.S. Metropolitan Portion|2001|20010
U.S. Metropolitan Portion|2001|20010
U.S. Metropolitan Portion|2001|20010

The meta data is in the datasets, columns, tables and config tables.  There isn't much in this file, but the design allows for a lot of information on each table and column,

The next releases will add a remote repository, install dependencies, and discovery.  We'd also like to explore  having our build process generate CKAN Data Packages, which would primarily involve creating a datpackage.json file from our bundle.yaml file, and dumping the sqlite database tables as .csv files into the /data directory of the bundle. 

I'd appreciate hearing your comments, and opening a discussion about how to work with  the CKAN format and the CKAN repository. 

eric. 

--------------------------------------------------------------------------------------------------
Eric Busboom, CEO, Clarinova                                               (858) 386-4134














More information about the data-protocols mailing list