[okfn-dev] data package source

Wed Dec 8 20:22:53 UTC 2010

On 7 December 2010 15:26, Matthew Brett <matthew.brett at gmail.com> wrote:
> Hi Rufus and all,
>
> We (nipy folks [1], and neurodebian folks [2], and maybe others) have been
> thinking a little bit about what we wanted from a data package implementation.
>
> First - an apology.  I have tried to explore datapkg, but rather superficially.
> What we've done, in the main, is to try and think out what we mean by stuff, and
> what we want, and we're slowly then coming back to what y'all have done.

Have you had a chance to look through the datapkg docs?

<http://packages.python.org/datapkg/>

Just so I know where we are starting from :)

[...]

> Now we're thinking what we really want.  The result of various discussions ended
> up in the attached document ``data_pkg_discuss.rst``.  As the name suggests,
> it's trying to clarify various ideas we had about what is what.
>
> Now onto something real, usecases...
>
> We have - for example - a smallish package for reading image data - nibabel.  We
> want to be able to use optional data packages from within nibabel.  In
> particular, we wanted packages of test data of images in various formats, that
> are too large to include in the code repository.  Here's some things we wanted:
>
> * No dependency for nibabel on the data packaging code.  That is, we wanted to
>  be able to *use* installed data packages without having to install - say
>  ``datapkg``.  This is obviously not essential, but desirable.  We're less
>  concerned about having to depend on - say - ``datapkg`` for installing the
>  data, or modifying the data packages.  Having said that, it would surely help
>  adoption of a standard packaging system if it was easy to implement a
>  packaging protocol outside of the canonical implementation in - say -
>  ``datapkg``.

Obviously the package spec can be a standard anyone can implement but
surely it would make sense, as least at the start, to reuse library
code where possible here (especially where the standard/code are
evolving as they stabilise). Otherwise one is just going to
reimplement the same stuff.

At the same time I can see how it would be desirable to have something
very lightweight (a single file perhaps) that can be shipped with a
project so there is no external dependency.

> * Support for data package versions.  We expect to have several versions of
>  nibabel out in the wild, and maybe several versions of nibabel on a single
>  machine.  The versions of nibabel may well need different versions of the data
[...]

Got it: versions are a must.

> * Support for user and system installs of data. As for python package installs,
[...]

I think this is important but initially a lower priority.

> * Not of urgent importance for us, but it would be good to be able
> sign the packages
>  with a trusted key, as for Debian packages.
>
> For these various reasons we tried to spec out what we thought we would need in
> the attached ``data_pkg_uses.rst``.  I've also attached a script referenced in
> that page, ``register_me.py`` - as ``register_me.txt``.
>
> Given my relative ignorance of ``datapkg``, I'll try to say the differences I
> see from the current ``datapkg``:
>
> * I can't see support for data package versioning in ``datapkg`` - but I might
>  have missed it.
> * As far as I can see, there isn't a separation of system and user installs, in
>  that there seems to be a (by default) sqlite 'repository' (right term?) that
>  knows about the packages a user has installed, but I could not find an
>  obvious canonical way to pool system and user installation information.  Is
>  that right?

That's correct. I have to say on a first pass I don't think this is so
important -- but you can convince I'm wrong. (It also seems relatively
easy to "search" several paths).

> * Because the default repository is sqlite, anyone trying to read the
>  installations that ``datapkg`` did, will need sqlite or something similar.
>  They'll likely have this if they are using a default python installation, but
>  not necessarily if they are using another language or a custom python install.

There are currently supports for several 'indexes'. The default is a
'db-based' one using sqlite but have also implemented a simple
'fileindex' (i.e. walking a tree), one based on ckan and a simple
in-memory dict one.

> Are these right?.  Do our usecases make sense to y'all?

These look good but are entirely focused on 'loading' locally. Do you
need material to get on/off the machine?

> We'd love to work together on stuff if that makes sense to you too...

Absolutely, would be great to work together here.

Rufus