[okfn-help] [get.theinfo] datahub 0.8 is available

Mon Dec 7 17:44:17 GMT 2009

2009/12/4 Lukasz Szybalski <szybalski at gmail.com>:
[...]
>> Right, that's what we did first when in v0.1 of datapkg a couple of
>> years ago. But what happens if you want to handle stuff that *isn't* a
>> python package. Plus PyPI may not be that happy if you start uploading
>> packages with 100s of MB in them (or even GBs).
>
> I agree on that. I would assume that almost any package can be created
> with just a code, and data should reside somewhere else? Do you have
> any examples of data that need to be included with a package, instead
> of downloading it from outside source?

So in your current framework the data always resides elsewhere?

Also I guess you'd like to distinguish "source" versus "compiled". Not
everyone will want to compile from source ...

>  Data is so much larger
>> than code stuff that I think we need a slightly different
>> architecture. Also not everyone wants to plugin to python. Thus while
>> datapkg supports python packages straight up it is also designed so it
>> can consume other stuff easily.
>
> I agree that forcing python is not going to work, so I use python
> package as a medium only. All the programs/tools within it are your
> choice. You can use shell scripts to perl code. I think as long as you
> document what is required, then everything else should be automated in
> process.sh, so that user only installs required packages/programs and
> runs process.sh.

Sounds sensible.

> Do you know how many packages exist created by datapkg?

Well I've created about 20+ for openeconomics.net/store/. CKAN has a
lot more but those aren't all (yet) datapkg style packages.
[...]

>> Well imagine a package that points to a massive database. You might
>> not want to install 5TB on your machine but you might want to just
>> talk to the API. So you'd like to support "packages" which just expose
>> APIs.
>
> wouldn't that be done in setup.py under required packages?

But what would represent that API as a package? We will want data
packages that are "virtual" or have virtual payloads in that the
package doesn't have any data itself but can provide access to to some
remote resource (and hence that package "represents" that remote
resource).
[...]

>> Did you run from "HEAD" (ie. from mercurial repo)? As I said in
>> previous email "You'll need to install from the mercurial repository
>> to get up to date code ...".
>
> not yet.

Best to run off head at present. I've put up the latest codebase as
v0.4a on PyPI if you want to easy_install.

> Do you know of any datapkg packages that parse public.resource.org?

Not at present.

> Do you have a list of packages that have datapkg package? on
> http://www.ckan.net/package/list?

We're almost planning to go the other way round: start generating
datapkg packages for each entry on CKAN ...

> Do you query data.gov for available datasets?

Not at present but we've thought of doing an automated extract into CKAN ...

> Looking at http://www.ckan.net/package/list, how can one query a list
> of "data sources/packages" that have download-able data, or
> download-able parser?

That's a good question :) You can do this in the user interface but
how you'd do this in API I'd need to think about. Basically you'd want
to query on download_url attribute which you can do but I'm not sure
how you just ask for package's with any non-null value.

> I guess datapkg and datahub, as well as users would benefit for "query
> for available data", "query for download-able data","query for parsers
> of data".

Good point and this suggests we may want different types of
datapkg/datahub packages.

Rufus
-- 
Promoting Open Knowledge in a Digital Age
http://www.okfn.org/ - http://blog.okfn.org/