[data-protocols] Civic Knowledge Data Bundles Initial Code Release
Eric Busboom
eric at clarinova.com
Tue Jul 3 19:30:46 BST 2012
I'm working on a data packaging format called Data Bundles ( http://clarinova.com/bundles ) This package format has some overlap with CKAN Data Packages, but with a different set of requirements. ( Our Bundles are designed to be automatically installed in a relational data warehouse. ) I think we can harmonize our Bundles with Data Packages, or at least allow generating a Package from our Bundle Source code.
( Sorry about the name, I recently noticed that Data Packages also uses the name "Bundles". We'll change our name to not conflict, after I think of something better. )
The first release of the code is available for preview at:
http://pypi.python.org/pypi/databundles/0.0.6
This really early code, and only implements a few of the features, but it will give you a sense of our approach.
You can get the code for some of the bundles we are working on packaging at:
https://github.com/clarinova/civicdata
This code includes bundles for the US Census Summary file 1, State GDP for the US Bureau of Economic Analysis, and some test bundles. Here is an example of building the BEA file, after unpacking the bundle code from github.
$ cd civicdata/bea.gov/bea.gov-metro_gdp-orig
$ python bundle.py
LOG: ---- Preparing ----
LOG: Downloading http://bea.gov/regional/zip/GDPMetro.zip
LOG: Extracting/Volumes/Storage/proj/github.com/civicdata/bea.gov/bea.gov-metro_gdp-orig/downloads/allgmp.csv from /Volumes/Storage/proj/github.com/civicdata/bea.gov/bea.gov-metro_gdp-orig/downloads/http%3A%2F%2Fbea.gov%2Fregional%2Fzip%2FGDPMetro.zip
LOG: ---- Done Preparing ----
LOG: ---- Build ---
LOG: Extracting/Volumes/Storage/proj/github.com/civicdata/bea.gov/bea.gov-metro_gdp-orig/downloads/allgmp.csv from /Volumes/Storage/proj/github.com/civicdata/bea.gov/bea.gov-metro_gdp-orig/downloads/http%3A%2F%2Fbea.gov%2Fregional%2Fzip%2FGDPMetro.zip
LOG: ---- Done Building ---
LOG: ---- Install ---
LOG: ---- Done Installing ---
LOG: ---- Skipping Submit ----
Now, the bundle is in the build directory. It's a sqlite3 file:
$ sqlite3 build/bea.gov/metro_gdp-orig-a7d9-r1.db
sqlite> .tables
columns datasets metro_gdp tables
config files partitions
sqlite> select geoname, 2001, 20010 from metro_gdp limit 5;
U.S. Metropolitan Portion|2001|20010
U.S. Metropolitan Portion|2001|20010
U.S. Metropolitan Portion|2001|20010
U.S. Metropolitan Portion|2001|20010
U.S. Metropolitan Portion|2001|20010
The meta data is in the datasets, columns, tables and config tables. There isn't much in this file, but the design allows for a lot of information on each table and column,
The next releases will add a remote repository, install dependencies, and discovery. We'd also like to explore having our build process generate CKAN Data Packages, which would primarily involve creating a datpackage.json file from our bundle.yaml file, and dumping the sqlite database tables as .csv files into the /data directory of the bundle.
I'd appreciate hearing your comments, and opening a discussion about how to work with the CKAN format and the CKAN repository.
eric.
--------------------------------------------------------------------------------------------------
Eric Busboom, CEO, Clarinova (858) 386-4134
More information about the data-protocols
mailing list