[ckan-dev] CKAN and Civic Knowledge Data Bundles

Rufus Pollock rufus.pollock at okfn.org
Thu Jun 14 18:22:45 UTC 2012


On 13 June 2012 22:00, Eric Busboom <eric at clarinova.com> wrote:
[...]

> To support this project we are also creating a data format. Our format has some particular requirements, which include being able to break up a single dataset into multiple partitions. Just one of the 9 US Census datasets is about 80GB, unmanageably large for most users, so the dataset gets partitioned into about 2,000 files, requiring special features to manage.
>
> The requirements and design documents for the Data Bundles are here:
>
>        http://www.clarinova.com/bundles
>
> However, despite the differences in requirements, it would be quite sensible to provide a way to convert our bundles into CKAN packages. This would result in benefits for both our projects:

To be clear I think there are 2 distinct things:

* CKAN datasets - datasets (metadata + possibly data) in a CKAN instance
* "Data packages" as per
http://www.dataprotocols.org/en/latest/data-packages.html

While the former could / should be instances of the latter they need
not be a equivalent. I think we should focus discussion on "data
packages".

>        * It would make available many high value datasets in the CKAN format.
>        * It would allows users to access Civic Knowledge data via CKAN APIs and search functions.
>
> As we work on the design, I'd like to keep track of developments on the CKAN package spec and post updates of our spec, keeping open to places where the two can be harmonized.
>
> I'm very open to comments and suggestions, so please let me know what you think,

It seems like there is a huge overlap in your developing spec and the
existing data package spec + data package manage implmentation (the
implementation currently adds things not in the spec like the package
database). This is a good thing as it shows there is real commonality
here. This is also suggests that we could possibly combine the two!

Rather than go into detail here on this some thoughts:

a) I think we should probably switch this discussion to
http://lists.okfn.org/mailman/listinfo/data-protocols - sorry for the
pain of requesting the switch + (possibly) a re-post but I think that
is probably the best location for this slightly more general
discussion of data packages and our presence in both locations can
always ensure that info flows as necessary.

b) Rough consensus and running code - esp running code that does
something immediately useful (tm). We need to ensure we do something
useful fast for people. (dpm has reached this stage for me -- I now
use it almost daily to get datasets off CKAN and onto my machine).

c) We first began work on the data package spec and dpm ~5y. Looking
back I feel we were a bit ahead of our time and ahead of the ecosystem
-- you need certain things to exist before this becomes useful. We're
reaching that point.

Rufus

PS: re your source spec see
https://github.com/okfn/dpm/blob/master/doc/new-plan-2011-nov.rst




More information about the ckan-dev mailing list