[okfn-dev] data package source

Wed Dec 8 21:40:36 UTC 2010

Hi,

>> First - an apology.  I have tried to explore datapkg, but rather superficially.
>> What we've done, in the main, is to try and think out what we mean by stuff, and
>> what we want, and we're slowly then coming back to what y'all have done.
>
> Have you had a chance to look through the datapkg docs?
>
> <http://packages.python.org/datapkg/>
>
> Just so I know where we are starting from :)

Yes, I might have been a little too apologetic, I did read all the
docs.  I didn't always completely understand them though.  The
data_pkg_discuss.rst was an attempt to see whether we could formulate
concepts in a compatible way...  You know how it it, it depends where
you start, what you understand.

>> * No dependency for nibabel on the data packaging code.  That is, we wanted to
>>  be able to *use* installed data packages without having to install - say
>>  ``datapkg``.  This is obviously not essential, but desirable.  We're less
>>  concerned about having to depend on - say - ``datapkg`` for installing the
>>  data, or modifying the data packages.  Having said that, it would surely help
>>  adoption of a standard packaging system if it was easy to implement a
>>  packaging protocol outside of the canonical implementation in - say -
>>  ``datapkg``.
>
> Obviously the package spec can be a standard anyone can implement but
> surely it would make sense, as least at the start, to reuse library
> code where possible here (especially where the standard/code are
> evolving as they stabilise). Otherwise one is just going to
> reimplement the same stuff.
>
> At the same time I can see how it would be desirable to have something
> very lightweight (a single file perhaps) that can be shipped with a
> project so there is no external dependency.

I find that we are very cautious with dependencies.  For example, we
copy small things like ConfigObj and argparse into our own tree to
avoid depending on them.  I was hoping then, that we could get away
with code that was so small and trivial that it would make sense to
carry it in the project.   That statement might make more sense if I
implemented it - would that help?  I have a feeling that, for local
data package use, we can get away with a single 200 line python file.

What about the argument of making it easier for people not using
python as a programming language - is that sensible?

>> * Support for data package versions.  We expect to have several versions of
>>  nibabel out in the wild, and maybe several versions of nibabel on a single
>>  machine.  The versions of nibabel may well need different versions of the data
> [...]
>
> Got it: versions are a must.

Did you have a plan for versions?   I found it was surprisingly
difficult to come up with something that made sense, in a situation
where you're frequently working on development versions.

>> * Support for user and system installs of data. As for python package installs,
> [...]
>
> I think this is important but initially a lower priority.

It was part of our spec from the beginning, because we had always been
thinking about - say - linux package management.   Also, because data
packages can be large, we wanted to make sure that people were reusing
copies as far as possible.

>> * As far as I can see, there isn't a separation of system and user installs, in
>>  that there seems to be a (by default) sqlite 'repository' (right term?) that
>>  knows about the packages a user has installed, but I could not find an
>>  obvious canonical way to pool system and user installation information.  Is
>>  that right?
>
> That's correct. I have to say on a first pass I don't think this is so
> important -- but you can convince I'm wrong. (It also seems relatively
> easy to "search" several paths).

Ah - well - but if I'm already writing the code to find the indexes,
then I only have to be able to read a few ini files, and I've done the
entire job of package discovery, no?

>> * Because the default repository is sqlite, anyone trying to read the
>>  installations that ``datapkg`` did, will need sqlite or something similar.
>>  They'll likely have this if they are using a default python installation, but
>>  not necessarily if they are using another language or a custom python install.
>
> There are currently supports for several 'indexes'. The default is a
> 'db-based' one using sqlite but have also implemented a simple
> 'fileindex' (i.e. walking a tree), one based on ckan and a simple
> in-memory dict one.

Yes, I was heartened by that - but the problem is of course, that if
someone chooses the sqlite index, then they've made it relatively more
difficult to use the data from another utility or package.  Then we'd
have to say to our users 'use datapkg but be careful not to use the
sqlite option'.  How about something like xml or json instead of
sqlite?

> These look good but are entirely focused on 'loading' locally. Do you
> need material to get on/off the machine?

Ah - yes indeed - but it looked like you'd covered that pretty well :)

Also, for us, it's not that big a deal to say 'go to this web page to
download the zip file and then unpack it'.   We also felt confident we
could find web storage and permissions enough for our own needs.  Your
system is much nicer though.

>> We'd love to work together on stuff if that makes sense to you too...
>
> Absolutely, would be great to work together here.

At this stage, our cheerleader Eleftherios would say, go go team, and,
I do too...

See you,

Matthew

PS - if I don't answer email, it's because I don't have access - often
away from the internet here in Cuba, and back in the UK on the 18th.