[okfn-help] [get.theinfo] datahub 0.8 is available

Lukasz Szybalski szybalski at gmail.com
Wed Dec 2 23:17:31 GMT 2009


On Wed, Dec 2, 2009 at 2:24 PM, Jonathan Gray <jonathan.gray at okfn.org> wrote:
> Dear Lukasz,
>
> I've added you to the okfn-help list and will have to defer the
> technical details of datapkg to those on this list.
>
> All: Lukasz is working on DataHub, which aims to automate "download,
> parse and load" of data. Like CKAN its in Python, and it would be
> great to work with him if there are areas where the two projects could
> talk to each other! He's got some questions about datapkg (see below).
> It seems to me like there is some potential synergy?
>
> Lukasz: pardon my ignorance, but just to clarify, could you give an
> example use case for DataHub to give me a better idea of what it does
> and how it works? We're currently working on improving download URL
> (and file type metadata) in CKAN - I don't know if this is relevant to
> what you're doing...

Datahub is a tool that will create a new python package with some
sample files in it to help you crawl,parse,load your data.

so you start by (http://pypi.python.org/pypi/datahub/0.8.90dev)

1. paster create -t datahub

this will create a skeleton of a python project that has 3 main sufolders.

myapp/
myapp/crawl
myapp/parse
myapp/load

inside of each you do your magic.

In crawl folder you can put url of the files you want to download. Put
them in download_list.txt and there is a script crawl.sh that if run
will download these files.
In parse folder you do parsing. extract, or parse. There isn't any
help files there as parsing is unique to the package. see parse.sh
file.
In load folder you load the data. There is a sample load.py where you
need to define db structure, csv columns, and user/pass for the
database. Assuming you are loading from a csv as long as you create
proper fields in db and list proper fields for csv import the program
will do the rest. (setup db, drop then create tables, create session
to db, load data, save data, select first 5 rows to see if it worked)
(see http://pypi.python.org/pypi/datahub.gov.dot.nhtsa.recall/0.2dev
for actual working sample)

So that is the basics of the datahub. At this point there is no way to
list other datahub packages, there is no way to query for some
keywords, there is no set hosting you need to use.

datapkg on the other hand seem to do the later....query, search and
upload/load packages?????

Let me know what exactly datapkg does at this point?

Thanks,
Lucas



>
> Best wishes,
>
> Jonathan
>
> On Wed, Dec 2, 2009 at 8:12 PM, Lukasz Szybalski <szybalski at gmail.com> wrote:
>> On Wed, Dec 2, 2009 at 12:36 PM, Jonathan Gray <jonathan.gray at okfn.org> wrote:
>>> Hi Lukasz,
>>>
>>> I can't remember where we were up to in our discussion about Datahub
>>> and CKAN - but it would be great to pick this up again if there are
>>> useful ways in which we could ensure they work together!
>>>
>>> Especially regarding automated downloads, storing copies of data, etc.
>>> Two options here that spring to mind are Internet Archive, and Talis
>>> Connected Commons. We're also working on open data grid for
>>> decentralised storage. What are your thoughts here?
>>
>> As far as vision for datahub, I see it as a starting tool where you
>> put your code to download, parse,and load data source etc....There is
>> no specifics on where you store the data. You data could be stored in
>> IA or on your distributed storage. The primary concern is the
>> automating the "download, parse and load" of the data.
>>
>> Here is the first package created using datahub:
>>
>> http://pypi.python.org/pypi/datahub.gov.dot.nhtsa.recall/0.2dev
>>
>> So process is:
>> datahub -> default template -> start project -> automate it all ->
>> publish source.
>> datahub -> few weeks or months -> datahub.gov.dot.nhtsa.recall (now it
>> only takes 1 min to get from download to load.)
>>
>>
>>
>> I've looked at the
>> http://www.knowledgeforge.net/ckan/trac/browser/datapkg/trunk and
>> datapkg seems similar but its already in phase 2, meaning it allows
>> you to list packages that are available. Is that implemented in
>> datapkg? Where do you get a list? what else can you do with it?
>>
>>
>>
>> I haven't really looked at storing the whole package with data on some
>> archive site simply because the size of data I use is small, or data
>> is available somewhere else. Do you have any packages that load and
>> parse the data? or are using datapkg?
>>
>> Thanks,
>> Lucas
>>
>>
>>
>>
>>
>> Thanks,
>> Lucas
>>
>>>
>>> Best wishes,
>>>
>>> Jonathan
>>>
>>> On Wed, Dec 2, 2009 at 5:35 PM, Lukasz Szybalski <szybalski at gmail.com> wrote:
>>>> http://pypi.python.org/pypi/datahub/0.8.90dev
>>>>
>>>>    * Datahub is a tool that allows faster download/crawl, parse,
>>>> load, and visualize of data. It achieves this by allowing you to
>>>> divide each step into its own work folders. In each work folder you
>>>> get a sample files that you can start coding in.
>>>>    * Datahub is for people who found some interesting data source for
>>>> them, they want to download it, parse it, load it into database,
>>>> provide some documentation, and visualize it. Datahub will speed up
>>>> the process by creating folder for each of these actions. You will
>>>> create all the programs from our base default template and move on to
>>>> analyzing the data in no time.
>>>>
>>>>
>>>> If you are doing data conversions from public/private datasets this
>>>> tool is for you.
>>>>
>>>> Few packages that use datahub: (Recall database from NHTSA. Crawl,
>>>> parse, load into db as easy as running "sh process.sh")  coming soon.
>>>>
>>>> Enjoy.
>>>>
>>>> Lucas
>>>>
>>>> --
>>>> [from the http://groups.google.com/group/get-theinfo mailing list]
>>>>
>>>
>>>
>>>
>>> --
>>> Jonathan Gray
>>>
>>> Community Coordinator
>>> The Open Knowledge Foundation
>>> http://www.okfn.org
>>>
>>
>>
>>
>> --
>> Setup CalendarServer for your company.
>> http://lucasmanual.com/mywiki/CalendarServer
>> Automotive Recall Database - See if you vehicle has a recall
>> http://lucasmanual.com/recall
>>
>
>
>
> --
> Jonathan Gray
>
> Community Coordinator
> The Open Knowledge Foundation
> http://www.okfn.org
>



-- 
Setup CalendarServer for your company.
http://lucasmanual.com/mywiki/CalendarServer
Automotive Recall Database - See if you vehicle has a recall
http://lucasmanual.com/recall



More information about the okfn-help mailing list