[data-protocols] New Data Package Format
Eric Busboom
eric at clarinova.com
Fri Jun 15 17:54:29 BST 2012
Hi All,
Rufus asked me to move this discussion over to this list from the CKAN dev list -- apologies for the repost -- so, I'll repost the original message and then reply with a response to Rufus' reply. Hope that works out ok for this list.
---
I'm working on a related project that would benefit from and can contribute to CKAN, so I'd like to introduce myself and the project.
Short Story: We're building a data warehouse for public data, and part of the project is a data package format that is similar to the CKAN DataPackage. I'd like to swap experience with the two formats.
The project, Civic Knowledge, is creating data warehouses for public datasets, with an early focus on investigative journalists. By "data warehouse" we mean the formal definition of a large database with a structure that is specifically designed for reporting and analysis, in the style of Inmon and Kimball. ( We're mostly Kimballites )
The promise of this project is that journalist can visit the Civic Knowledge website and immediately start issuing queries on a wide range of linked public data sets, quickly answering questions, such as "is the rate of nursing home violations correlated with the average income of an area, after controlling for crime rate? "
Users will also be able to get direct access to the database -- they will get the server, username and password for a Postgres database -- and can hook up Tableau, qGIS, Navicat, or any other reporting tool that can connect to a Postgres database.
Here is the corporate-speak overview:
http://www.clarinova.com/civic-knowledge-overview
To support this project we are also creating a data format. Our format has some particular requirements, which include being able to break up a single dataset into multiple partitions. Just one of the 9 US Census datasets is about 80GB, unmanageably large for most users, so the dataset gets partitioned into about 2,000 files, requiring special features to manage.
The requirements and design documents for the Data Bundles are here:
http://www.clarinova.com/bundles
However, despite the differences in requirements, it would be quite sensible to provide a way to convert our bundles into CKAN packages. This would result in benefits for both our projects:
* It would make available many high value datasets in the CKAN format.
* It would allows users to access Civic Knowledge data via CKAN APIs and search functions.
As we work on the design, I'd like to keep track of developments on the CKAN package spec and post updates of our spec, keeping open to places where the two can be harmonized.
I'm very open to comments and suggestions, so please let me know what you think,
thanks,
eric.
--------------------------------------------------------------------------------------------------
Eric Busboom, CEO, Clarinova (858) 386-4134
More information about the data-protocols
mailing list