[okfn-dev] data package source
Rufus Pollock
rufus.pollock at okfn.org
Wed Dec 8 20:22:53 UTC 2010
On 7 December 2010 15:26, Matthew Brett <matthew.brett at gmail.com> wrote:
> Hi Rufus and all,
>
> We (nipy folks [1], and neurodebian folks [2], and maybe others) have been
> thinking a little bit about what we wanted from a data package implementation.
>
> First - an apology. I have tried to explore datapkg, but rather superficially.
> What we've done, in the main, is to try and think out what we mean by stuff, and
> what we want, and we're slowly then coming back to what y'all have done.
Have you had a chance to look through the datapkg docs?
<http://packages.python.org/datapkg/>
Just so I know where we are starting from :)
[...]
> Now we're thinking what we really want. The result of various discussions ended
> up in the attached document ``data_pkg_discuss.rst``. As the name suggests,
> it's trying to clarify various ideas we had about what is what.
>
> Now onto something real, usecases...
>
> We have - for example - a smallish package for reading image data - nibabel. We
> want to be able to use optional data packages from within nibabel. In
> particular, we wanted packages of test data of images in various formats, that
> are too large to include in the code repository. Here's some things we wanted:
>
> * No dependency for nibabel on the data packaging code. That is, we wanted to
> be able to *use* installed data packages without having to install - say
> ``datapkg``. This is obviously not essential, but desirable. We're less
> concerned about having to depend on - say - ``datapkg`` for installing the
> data, or modifying the data packages. Having said that, it would surely help
> adoption of a standard packaging system if it was easy to implement a
> packaging protocol outside of the canonical implementation in - say -
> ``datapkg``.
Obviously the package spec can be a standard anyone can implement but
surely it would make sense, as least at the start, to reuse library
code where possible here (especially where the standard/code are
evolving as they stabilise). Otherwise one is just going to
reimplement the same stuff.
At the same time I can see how it would be desirable to have something
very lightweight (a single file perhaps) that can be shipped with a
project so there is no external dependency.
> * Support for data package versions. We expect to have several versions of
> nibabel out in the wild, and maybe several versions of nibabel on a single
> machine. The versions of nibabel may well need different versions of the data
[...]
Got it: versions are a must.
> * Support for user and system installs of data. As for python package installs,
[...]
I think this is important but initially a lower priority.
> * Not of urgent importance for us, but it would be good to be able
> sign the packages
> with a trusted key, as for Debian packages.
>
> For these various reasons we tried to spec out what we thought we would need in
> the attached ``data_pkg_uses.rst``. I've also attached a script referenced in
> that page, ``register_me.py`` - as ``register_me.txt``.
>
> Given my relative ignorance of ``datapkg``, I'll try to say the differences I
> see from the current ``datapkg``:
>
> * I can't see support for data package versioning in ``datapkg`` - but I might
> have missed it.
> * As far as I can see, there isn't a separation of system and user installs, in
> that there seems to be a (by default) sqlite 'repository' (right term?) that
> knows about the packages a user has installed, but I could not find an
> obvious canonical way to pool system and user installation information. Is
> that right?
That's correct. I have to say on a first pass I don't think this is so
important -- but you can convince I'm wrong. (It also seems relatively
easy to "search" several paths).
> * Because the default repository is sqlite, anyone trying to read the
> installations that ``datapkg`` did, will need sqlite or something similar.
> They'll likely have this if they are using a default python installation, but
> not necessarily if they are using another language or a custom python install.
There are currently supports for several 'indexes'. The default is a
'db-based' one using sqlite but have also implemented a simple
'fileindex' (i.e. walking a tree), one based on ckan and a simple
in-memory dict one.
> Are these right?. Do our usecases make sense to y'all?
These look good but are entirely focused on 'loading' locally. Do you
need material to get on/off the machine?
> We'd love to work together on stuff if that makes sense to you too...
Absolutely, would be great to work together here.
Rufus
More information about the okfn-labs
mailing list