[ckan-dev] "Data Package" specification page

Mon Jul 4 19:15:32 UTC 2011

Hi,

On Mon, Jul 4, 2011 at 6:19 PM, Friedrich Lindenberg
<friedrich.lindenberg at okfn.org> wrote:
> Hi Matthew,
>
> On Mon, Jul 4, 2011 at 6:45 PM, Matthew Brett <matthew.brett at gmail.com> wrote:
>> I noticed this statement:
>>
>> "Data packages are nothing but metadata"
>>
>> Could you clarify what you mean?
>
> This is probably an oversimplification but part of a larger
> discussion: when we're talking about a data package - do we mean the
> sum of all the referenced data or just the references.

Our own case, is that we need to distribute a series of medical images
in packages.

As a matter of interest - what is your use-case?

> There are
> several ways in which this could be answered:
>
> 1) Include all data, even if its TBs of stuff (e.g. scientific data) -
> most useful model but also raises issues such as: when a referenced
> resource changes, how do we know about this?

Sub-packages if the data is likely to change in particular known patterns?

> 2) Consider just core metadata or core metadata and
> processing/provenance/status/quality metadata.

You mean what is now meant to be in the datapackage.json file?

> 3) Distinguish between reasonably and unreasonably sized resources,
> mark them. This is then analogous to the various
> pass-by-reference/pass-by-value discussions we have in computing
> generally.

What does the 'reference' resolve to when you want the data?

> 4) Inline smaller resources into the metadata catalogue (repository, then)

Yes, could do, but no need if we are already storing data apart from
the metadata file.

> I'm not really sure here, but coming from a practical point of view
> I'd like to have as much management of my resources as I can get
> without having to switch all my other tools and practices. In other
> words: CKAN should help me describe what I do, not dictate how I do
> it.

The API should surely correspond to common use-cases, but it does need
to manage the data somehow, I would say.

>> You are proposing (I think) a DVCS frontend to data, as a CLI, where
>> the history is stored in an upstream server.  Would this differ from
>> standardizing to SVN?  I'm not proposing that, I just wanted to get an
>> idea of where you differ...
>
> You don't really need SVN if you're talking about the metadata only -
> its just a single object which can be managed more specifically and
> through a nicer, RESTful interface. Once you do start to include the
> data, you want VCS - and I've been spending quite a bit of time
> wondering if we shouldn't make CKAN package pages into HG repos the
> same way e.g. bitbucket.org pages are (this is trivial for HG, git
> should also be possible).

Yes, various people have thought about git as frontend to remote-local
storage with large files.  As I understand it, hg has almost the same
data model, so the implementations might well be similar.

https://github.com/apenwarr/bup
http://code.google.com/p/gittorrent/wiki/MirrorSync
http://git-annex.branchable.com/

Cheers,

Matthew