[okfn-help] [get.theinfo] datahub 0.8 is available

Thu Dec 3 20:13:47 GMT 2009

On Thu, Dec 3, 2009 at 8:27 AM, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> 2009/12/2 Lukasz Szybalski <szybalski at gmail.com>:
>> On Wed, Dec 2, 2009 at 2:24 PM, Jonathan Gray <jonathan.gray at okfn.org> wrote:
> [...]
>> Datahub is a tool that will create a new python package with some
>> sample files in it to help you crawl,parse,load your data.
>>
>> so you start by (http://pypi.python.org/pypi/datahub/0.8.90dev)
>>
>> 1. paster create -t datahub
>>
>> this will create a skeleton of a python project that has 3 main sufolders.
>>
>> myapp/
>> myapp/crawl
>> myapp/parse
>> myapp/load
>
> I like the layout that you have developed here.
>
> This is somewhat similar to what datapkg create <...> will do. However
> we just create a very basic layout (like a simple python package) at
> the moment.  An example of current datapkg package along these lines:
>
> <http://knowledgeforge.net/econ/hg/file/tip/econdata/uk_house_prices>
>
> One with a non-pythonic layout is:
>
> <http://knowledgeforge.net/econ/hg/file/tip/econdata/browser_stats>
>
> We have recently been talking quite a bit about what structure we
> should use: be it none (leave it to users) or something like R or
> Debian or ... We have also been asking whether we need to support
> multiple structures of "just one" (tm). I'd be interested in your
> thoughts here.
>\

Well. I initially started with the simplest structure (1 folder). The
structure that is created via "paste create". All subsequent changes
are done from there. As far as crawl, parse, load....these folders I
derived when looking at few packages that actually did get and parsed
data. They didn't have the folders but all the files fit in either of
the 3 options(crawl, parse, or load)

Later on I've added hdfs and wiki folders but I myself didn't find a
reason to use them yet. I also was thinking about web or visualize but
again...everybody has a unique need to that data.

Folder structure I think is less important as long as you have the
main 3. I think the extra tools/code that helps you parse, or downlaod
data will be most useful. I use wget, so as long as you type in url
you don't have to figure out what command of wget to use to get them.
If you load csv, then python csv module is simplest, you don't have to
lookup code to figure out how (just type in the column names, and name
of the file). If you want to load it to database, you don't have to
know how to create sql queries, just type in column names and what
type they are. If your data is mdb then mdbtools might be the easiest
to get the data out, there is a command line for extracting mdb to
csv....etc.

> [...]
>
>> So that is the basics of the datahub. At this point there is no way to
>> list other datahub packages, there is no way to query for some
>> keywords, there is no set hosting you need to use.
>>
>> datapkg on the other hand seem to do the later....query, search and
>> upload/load packages?????
>>
>> Let me know what exactly datapkg does at this point?
>
> Yes datapkg allows you to register packages on ckan.net, query
> existing packages on ckan.

datahub by default is a python package so pypi is best deployment option

python setup.py register
python setup.py sdist upload

and you are done. If you want to just zip the file with code and data
you can just run.
python setup.py sdist
a file in dist folder will be created called yourpackage.tar.gz....

what do you use to query ckan? xmlrpc? or?
waht do you use to expose ckan info?

We're just in the process of reworking the
> install support -- this is rather more complex than in the code case
> because of a need to support package payloads which are e.g. apis
> rather than actual chunks of data.

What's "apis/package payload"

>
> Current datapkg documentation (for trunk):
>
> <http://knowledgeforge.net/ckan/doc/datapkg/>
>
> Instructions for installation are here if you want to give it a whirl:
>
> <http://knowledgeforge.net/ckan/doc/datapkg/install.html>

similar installation for datahub
easy_install datahub

I read over your man page, but I got lost a little in the setting up
repository. Also the
datapkg --repository=http://ckan.net/api/rest/ list didn't work.

with pypi you can download a package using:
easy_install -zmaxd . datahub.gov.dot.nhtsa.recall
then
unzip datahub.gov.dot.nhtsa.recall

I've looked at the uk_house_prices and browser_stats...
In my case I would split the crawl,parse, and load.

Is there a reason you use swiss package instead of tools that already
exist like pyexcelerator, etc.?

Thanks,
Lucas