[ckan-discuss] FW: Re: Provenance of datasets on CKAN

William Waites ww at styx.org
Sat Feb 19 12:26:51 GMT 2011


* [2011-02-18 12:37:49 +0100] Stefano Costa <stefano.costa at okfn.org> écrit:

] Not sure I understand correctly, but doesn't the current "hash" provide
] a lo-fi version of this same approach? Just wondering if it could be
] extended rather than started from scratch.

Not sure which hash you're talking about, but yes if there 
were a hash for a resource it would help. This is the strategy
taken by the BSD ports/pkgsrc and Gentoo portage where for 
any thing to be downloaded is checked against a hash maintained
by the package maintainers. That said those systems are a 
little bit more closed with package maintainers and changes
to packages being vetted to a greater or lesser extent. On
the bright side we do have metadata history so if some 
anonymous person changes something it can be tracked down
and if necessary rolled back.

This doesn't address changes and derived data though. 
Continuing the BSD ports analogy, the installed version
of some software is derived from the source (checked 
against the hash) by a known process that is encoded in
makefiles. It might be worthwhile considering this kind of
approach rather than having python snippets or scraperwiki
fragments directly - python is great but there are many
ways to transform and process data and not everybody likes
to program in python. makefiles are designed to describe
this type of process in a general way without presupposing
what tools are used to do the work.

Also it has been my experience that while it can take longer
to build a system in this way it ends up being more 
coherent with less extraneous cruft than the packaged-binary
approach (though they're not necessarily exclusive,
installing something from ports builds a packaged binary
and installs it, but it is the accessibility of the build
process that is important). This kind of coherence is 
probably a desireable characteristic for data processing
as well - set up a pipeline of dependencies, reconstruct
the intermediate parts, and produce output with some 
confidence in the integrity of the process, particularly
if something like hashes of the intermediate data are
also known.

Cheers,
-w


-- 
William Waites                <mailto:ww at styx.org>
http://river.styx.org/ww/        <sip:ww at styx.org>
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45



More information about the ckan-discuss mailing list