[okfn-dev] Thinking more about datapkg

Thu Mar 10 05:49:16 UTC 2011

Hi,

Sorry - I know this discussion is going super slowly...

I've posted also to the nipy-devel list because we're still trying to
find our way to a good solution to this problem.

On Mon, Feb 14, 2011 at 5:31 AM, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> On 12 January 2011 18:28, Matthew Brett <matthew.brett at gmail.com> wrote:
>> Hi,
>>
>> Rufus and I sat down for a while over the new year to think about
>> datapkg design.
>>
>> This email is not really a summary of that discussion, but thoughts
>> that came to me after the discussion.  I think we are hoping for
>> feedback.
>
> We also had some nice scribbled diagrams -- have you got scans of these Matthew?

Sorry - no - I have a phone that is so ugly I forget it has a camera...

>> One thing we discussed was the idea of the set of metadata about the
>> package as a 'catalog entry'.
>>
>> I was playing with the idea of the catalog entry.
>>
>> Maybe a data package can be any collection of bytes, for which the
>> only necessary criterion is: we know how to get the bytes; we know how
>> to get the name.
>
> I think this is a key point: keep things as simple as possible and
> don't assume (as with software) that we are always dealing with files
> (we could have an API).
>
>> Start with an example.
>>
>> I've got some files in an archive named
>>
>> mydata-0.3.tar.gz
>>
>> I know how to get the bytes (because it's a tar.gz file).  The 'name'
>> is 'mydata-0.3'.    In this case, the catalog entry can be compiled by
>> guessing:
>>
>> name = mydata-0.3
>>
>> format = tar.gz
>>
>> Implied are:
>>
>> revision =
>> version =
>>
>> To publish 'mydata-0.3.tar.gz', I can make this trivial catalog entry,
>> or ask datapkg to make it, and then just add where I can get the data
>>
>> name = mydata-0.3
>> format = tar.gz
>> url = http://www.mydomain.org/files/mydata-0.3.tar.gz
>>
>> Now I just have to put this catalog entry somewhere (ckan, etc).
>
> [...]
>
>> That means, that there need be nothing specific about an archive, that
>> makes it a data package, but, of course, I can also make the catalog
>> entry be part of the archive.  That might be using (as now) a standard
>> name - catalog.json or something.
>
> Other points I remember that are important:
>
> * Distinction between a Package and a PackageRevision - first is the
> abstract thing 'PackageX' and and latter is PackageX as some
> version/revision (something I can actually get).
>
> * Use JSON for metadata and catalog file. (I've started work on
> converging on json in datapkg now that 0.8 is out the door [1])
>
> * Simple index file called catalog.json (and talked about relation
> between an index of things that could be installed versus list of
> things that were installed)

I'm sorry - this is only partly addressing your points, and I'm just
thinking aloud.

The things I'm trying to get at here, is how to do versions and
revisions, how to have the idea of 'user' and 'system' packages, and
what is in the 'index'.

We've got (above) the idea of a Package (abstract thing, a name, like
'the linux kernel' or 'ivo of chartes project texts' or 'X')

class Package(object):
    def __init__(self, name):
          self.name = name

>From time to time, instantiations of 'X' appear.  These instantiations
may be labeled with a version string ('0.5.3-rc1') and they may be
labeled with some identifier we might call a 'RevisionID' which is
meant to uniquely identify this state of the Package.  This would be
like an SVN revision number or a git or hg hash.  It's up to the
author to keep to this contract as they like.

class PackageInstantiation(object):
    def __init__(self, package, version=None, revision_id=None):
        self.package = package
        self.version = version
        self.revision_id = revision_id

Then there's a package 'distribution'.  I think this is what you are
calling a PackageRevision.  I prefer your previous term 'distribution'
because 'PackageRevision' clashes with the 'revision_id' idea above.
Although as I think we were saying before it's a little confusing
because it makes you think of something like a Debian distribution -
then - a collection of packages.  But anyway.  A distribution is some
concrete you-can-get-bytes-from-me thing, of various different
formats.  Let's say one is a .zip file

class Distribution(object):
    format = None
    def __init__(self, instantiation, *args, *kwargs):
        self.instantiation = instantiation

   def register_to(self, where):
        pass # do something - to be decided

    def get_meta(self):
        pass # get the metadata from the distribution contents somehow

class ZipDistribution(Distribution):
     format = 'zip'
     def __init__(self, instantiation, zipfilename):
         self.instantiation = instantiation
         self.zipfilename = zipfilename

Another might be a directory on disk

class PathDistribution(Distribution):
    format = 'path'
    def __init__(self, instantiation, path):
           self.instantiation = instantiation
           self.path = path

So far the system knows nothing about anything.  Now:

>>> import datapkg
>>> datapkg.discover_packages(['user', 'system']) # look for all packages in local locations
[]

OK.  Let's say I have some unstructured zipfile somewhere, and I want
to tell 'the system' that it's a distribution.

inst = PackageInstantiation(Package('my-package'), version='0.1',
revision_id = '231')
zipdist = ZipDistribution(inst, 'my-zipfile.zip')
zipdist.register_to('user')

Now:

>>> datapkg.discover_packages(['user', 'system']) # look for all packages in local locations
[ZipDistribution('my-package', '0.1', '231', 'my-zipfile.zip')]

What might have happened in this case, is that there will now be a
file ``.datapkg/user.packages`` of form:

[my-package]
version = 0.1
revision_id = 231
format = zipfile
zipfile = /path/to/my-zipfile.zip

So - I believe this is _not_ the metadata as datapkg currently has it.
 My feeling is that the metadata should be in one place only, that is,
with the package.   Thus, we might do a 'search'

>>> datapkg.searchfor('some-string', sources=['user', 'system'])

at which point we'd fetch the metadata from the distributions and search it:

results = []
for dist in datapkg.discover(sources):
    meta = dist.get_meta()
    for key, value in meta:
        if target_string in key or targe_string in value:
             results.append((dist, key, value))
return results

Of course, for non-local sources, we could cache these results
somehow.  In that case, it might seem as if the metadata is in one
place only, but in fact, transparently, it's been written to a cache
in order to speed up queries.  I think that cache is what you
currently have in your datapkg sql database?

So, that would allow

1) versioning (if the distributor wanted it)
2) revision ids like commit hashes etc (if the distributor wanted it),
3) local user, local system, remote places that distributions could be
known from.  For example source 'user' might by default be the read
contents of '.datapkg/*.packages', 'system' might be
'/etc/datapkg/*.packages' and 'ckan' might be
'http://www.okfn.org/pacakges/main.packages')
4) Very simple rules for other software to follow to register
themselves with the datapkg system ('write an entry in the
user.packages file'), and to use it.

Does that increase or decrease enthusiasm or insight?  Sorry for the
long delay in posting.

See you,

Matthew