[ckan-dev] Proposals for simplifying dpm

Rufus Pollock rufus.pollock at okfn.org
Sat Dec 10 12:55:46 UTC 2011


To give some context for others:

There's been discussion about data package manager (dpm) having more
support for what we term 'source' packages in addition to normal (more
binary-like) packages.

See:

* [super] Functionality for working with data in "source" form -
https://github.com/okfn/dpm/issues/20

* https://github.com/okfn/dpm/blob/master/doc/new-plan-2011-nov.rst

On Friday, 25 November 2011, Daniel Graziotin wrote:
>
> Hi Rufus and all dpm developers,

Sorry, for very slow reply Daniel! BTW: I would encourage you (and
anyone else wanting to work on dpm), if you have something you'd like
to do just to dive in and steam ahead -- I don't want people to get
held up having to wait on me :-)

> I was thinking about the ease of use (and development effort) of dpm.I
> see contributing to datasets like a Wikipedia page contribution - plus
> the resources - not a continuous bouncing of patches between
> developers.I think that making dpm acting like git is a very high

This is a really important (and ongoing) question: is data work more
like code (more heavily structured, more intensive) or more like
Wikipedia (large numbers of small contributions, less structured and
more extensive). I've generally been of the opinion that is, in
general, more like code though I think there are a bunch of
circumstances

>
> target. What about a package manager with some subversion/rsync
> capabilities?

To be clear, source package stuff would still be different from git /
hg -- we wouldn't support full-on versioning of data -- but we would
have commands that looked more like those found in source code. Rather
than repeat details here I refer to new doc I've just posted (but
wrote about a month ago!):

<https://github.com/okfn/dpm/blob/master/doc/new-plan-2011-nov.rst>

Regarding your specific questions of rsync and subversion (which I
think are somewhat orthogonal :

* rsync: this would be more suited for "blob" data e.g. that in CKAN
storage. The problem there is for rsync to work we need rsync daemon
on storage system and we regularly use s3 and google storage which
don't support rsync protocol (furthermore, we don't allow any
overwriting at all of individual files atm though that could be
changed).

* subversion: again we need to install something server side and I'm
not sure why we'd use this rather than, say, git / hg itself.

More generally: if we really want efficient server side incrementally
updatable storage I'd go for something like bup:
<https://github.com/apenwarr/bup>

However, the problem with all of these is they require something
serverside (impossible with S3 and GS ...). Morever, I think the real
benefit of syncing is with structured storage not blob storage (e.g.
webstore or couchdb).

> Example:
>
> - dpm add resource.extAdds a local resource to datapackage.json - i.e., package.resources).
> To add external resources, something beautiful would be
>
> dpm add http://blah.gov.ext/study/resource.csv (just links theresource
> URL and name to datapackage.json).

Completely agree -- see existing ticket <https://github.com/okfn/dpm/issues/12>

> -dpm delete resource.extDeletes the resource file and the relative entry in datapackage.json).

Yes. (Wonder about deleting actual resource as well -- or do we just
unlink from dataset).

> -dpm commitUploads the new resources, the modified resources (by looking at
> the hash value) & the updated dataset to CKAN.

Yes. cf recent doc. This is dpm push

> -dpm updateFetches the dataset files from CKAN & the resources if they changed.

This is dpm pull.

Clear we are thinking along the same lines :-)

> In case of conflict, overwrite them or at least abort the operation and warn the
> user. External resources - that are missing a hash value - will be
> (re)downloaded if the user confirms it.

> - dpm statusExplains which resources would be uploaded in case of a dpm commit.
>
> dpm reset (with confirmation)
> Destroys the local package; downloads the dataset and the
> resources again.

Why bother. Just dpm clone again :-)

> It does not seem efficient and requires that users are aware of the
> behaviors, but a well written documentation would help. Most
> importantly, it ensures that everybody has the latest version of
> everything.
>
> I am not belittling the current plans. By lowering the target, I think
> that a beautiful working tool may appear in a shorter time. I am quite
> sure that a typical dpm user will be satisfied with these
> functionalities.

I think it sounds like we agree and that you took the talk of git/hg
too seriously ;-). It is more about the similarity of commands (and
the fact we are dealing with 'source' data) rather than versioning (at
least, at the moment -- there are plans afoot dependent on a decent
syncing protocol where the incremental pushing / pulling could happen
at least with webstore / other structured data backend).

> This is just a stream of consciousness, there may be many things left
> out and is just a start point for launching further ideas.

This is great stuff Daniel and sorry for slow reply -- it has been
great to have your contribution so far and hope we can do more!

rufus




More information about the ckan-dev mailing list