[okfn-discuss] datapkg - haltering steps

Rufus Pollock rufus.pollock at okfn.org
Tue Oct 12 16:11:26 UTC 2010


On 10 October 2010 11:52, Matthew Brett <matthew.brett at gmail.com> wrote:
[...]

> We (nipy.org) are just going over how to deal with data packaging.
> Our first and very simple draft was here:
>
> http://nipy.sourceforge.net/nibabel/data_pkg_design.html

Looking at that document it looks like what you are speccing is what
we could call a 'Distribution' -- i.e. a serialization of a Package
(conceived abstractly as metadata + payload) to a given layout on
disk. See:

<http://packages.python.org/datapkg/distribution.html>

> as you can see this is very crude, especially compared to datapkg.
>
> I'm afraid, that I haven't yet looked in detail at datapkg, for which,
> please forgive me, but I had a few preliminary questions:

No problem. If you want more info latest docs are here:
<http://packages.python.org/datapkg/>

You will also have just seen an email announcing release of datapkg
0.7b with lots of new features.

> 1) I think our main usecase is being able to do something like this in our code:
>
> my_package_path = None
> try:
>    import some_data_pkg_manager as excelsior
> except ImportError:
>    hint = 'You need "some_data_pkg_manager", see http://a.helpful.url'
> else:
>   version, pth = excelsior.have_local_pkg('my_package', version=0.3)
>   if version >= 0.3: # well, you get what I mean
>      my_package_path = pth
>   else:
>      hint = excelsior.installation_hint('my_package', version=0.3)
> if my_package_path is None:
>    print hint
> else:
>   # Do something with the data
>   pass

Yes. This is the use case I would call:

"Load data from disk (or api)" once installed.

> I hope you see what I mean.  The main point is, we want to be able to
> query the local installations, whether system-wide or in the user
> space, to get where the data is, rather than automatically trying to
> pull the data down.   This is because - I work in Cuba and bandwidth
> there is terrible - and - it seems like it would work better with
> standalone installations.

Right, and there are plenty of other reasons too (e.g. you'd like your
webserver app that uses data to run off local copies not try and
retrieve remote data every time!)

> I'm sure you've covered that - I just couldn't see it at a first glance.

OK, from the command line using some real data (i.e. you should be
able to run this!):

# get an example 'datapkg'
$ datapkg download ckan://gold-prices /tmp/

# create your ~/.datapkgrc file if you haven't already
$ datapkg init config

# register it into your local (db) index - sqlite index at
~/.datapkg/repository/index.db
$ datapkg register /tmp/gold-prices

# dump out the data file from that package (called 'data'!)
# NB: this is the only part actually covered by your example
# (you assume you've already got the datapkg on disk)
$ datapkg dump gold-prices data

For this last part you can also do this in code:

<code>
import datapkg
# behind the scenes we are loading the 'default' sqlite-based datapkg index
# you could also do: pkg = datapkg.load_package(ckan://gold-prices')
# but in this case you would have no local data associated
# (you would have package 'resources' - i.e. urls)
pkg = datapkg.load_package('gold-prices')
fo = pkg.stream('data')
print fo.read(100)
</code>

> 2) The second thing was - on my (yes, I'm sorry) Mac, an attempt to do
> 'python setup.py develop' in the repository leads to a nasty set of
> error messages from setuptools, where it appears to be cycling over
> the Paste installation.  It was complicated enough that it wasn't
> clear to me which installation target was causing the problems -
> certainly it seemed to occur with 'urlgrabber' - but I thought I'd let
> you know.

Thanks. This latest release removes the dependency on paste. Please do
try again and let me know -- it is great to track down install bugs on
different platforms.

> 3) Related to same - one problem that we were trying to avoid with our
> crude setup was needing the data package installed in order to query
> the data.  That is, we were hoping to have minimal run-time
> dependencies.  datapkg has rather heavy dependencies - do you think
> there's any chance of a lightweight local query version, when not all
> the dependencies are met?    We (as a group) have had some bad
> experiences with setuptools in the past.

Yes, who hasn't had bad experiences with setuptools :)

One of the aims here is definitely to keep this tool very lightweight.
One effort in this direction in the latest release is the introduction
of various new 'entry points' where datapkg can be extended. That said
this pluggability explicitly used setuptools support for 'entry
points' so I don't think the setuptools dependency will go away
immediately. However, we could try making the 'reading from disk' part
of things independent of the main part of datapkg -- that way we could
plug it in to datapkg but also use it separately.

> 4) Just lastly - I wonder if y'all have had time to look at bento:
>
> http://cournape.github.com/Bento/html/index.html

No I haven't seen this. I assume you've also seen the new version of
distutils being developed under the auspices of the main python
distutils mailing list:

<http://pypi.python.org/pypi/Distutils2/1.0a3>

> It is our favorite hacker's attempt at a way out of distutils /
> setuptools hell - and he's very responsive to questions and
> suggestions.   You can incorporate an entire distribution of bento as
> one file on your project to do the build for you.   This was 300K at
> last count (David C the author made a bento file for nipy).  This is
> moderately annoying for code, but nothing for a data package I would
> have thought.   The idea (half formed) that I had would be that each
> data package would be responsible for its own installation - in our
> case through a setup.py file - but maybe instead through the bento
> build process.

Right. Currently the two types of Distributions we support are a
python format and a simple ini-based based format.

> Sorry if this is off point or poorly thought out - I should point out
> that it's currently near 4 in the morning.

Not poorly thought out at all and thanks for asking all these good questions.

Rufus
-- 
Open Knowledge Foundation
Promoting Open Knowledge in a Digital Age
http://www.okfn.org/ - http://blog.okfn.org/




More information about the okfn-discuss mailing list