[ckan-dev] publisher model

Tue Apr 26 14:19:41 UTC 2011

Hi,

On 26 April 2011 10:02, Friedrich Lindenberg
<friedrich.lindenberg at okfn.org> wrote:
> Hi Seb,
>
> On Fri, Apr 22, 2011 at 11:39 AM, Seb Bacon <seb.bacon at okfn.org> wrote:
>> Something I meant to follow up with you was your dream about sorting
>> the publisher model.
>>
>> I'm not entirely clear what you mean by this.  Could you explain?
>
> By now, I'm actually a proponent of "going Github": have a single
> domain entity "Publisher" from which both users and institutional
> publishers are derived and a n-1 relation between datasets and
> publishers

So publisher has many datasets, and a dataset has only one publisher.
I presume by "going Github" you mean to preserve the provenance of a
dataset through maintaining a graph of publishers or datasets?

<snip>
> I think at the moment, CKAN embodies a false notion of shared
> ownership which is neither true for institutional environments, nor
> for data wranglers (listen to what we actually say: "I have this
> dataset that I worked on"). We want CKAN to become a public data
> workbench, but at the same time workbenches are things that are very
> specific to their owners (everything else is an assembly line).

Yes.  I suspect we may have to change the terminology here:

(1) "Publisher" has several different and specific meanings depending
on the metadata standard or other context.  Perhaps we could invent
our own term to disambiguate, e.g. "Foundry" or "Workshop" (following
your workbench analogy).  (We also need to preserve the original
author somehow; which term do we currently use for this?  Author?)

(2) To me, "Dataset" has some implication of a package of resources
that were originally released together, somehow -- it implies intent
by the original author.  I understand that you are talking about a
collection of resources for the purposes of data wrangling;
personally, I like "Workbench" for this.

So we could have something like:

- Resource: a CSV file or TXT file or similar
   - e.g. lat/lon of fire incidents in England
- Workbench: a collection of Resources which a user has gathered
together to answer some data question
   - example question: "what are the top five administrative areas in
the UK for fire incidents?"
   - example resources:
      - lat/lon of fire incidents in England / Wales / Scotland /
Northern Ireland
      - UK local administrative boundary shapefiles
      - UK local administrative area names
 - Workshop: corresponds to a user account or a institutional account
   - e.g. "UK Cabinet Office" or "Joe Smith"

Question: for an institutional user, would a primary source release
(e.g. http://data.gov.uk/dataset/financial-transactions-data-whittington-nhs-trust)
still be a Workbench, albeit a specially flagged one?

> (this starts to make sense with resources that are
> independent of datasets, so my dataset and your dataset may share a
> resource; plus it makes authz a lot simpler).

I believe there's general agreement that this is the right direction.

Exactly how does it make authz simpler?  Something like: a Resource
and a Workbench would only have one owner (Workshop), and people would
fork Workbenches or make brand new ones if they wanted to edit them?

We also, of course, need to preserve some notions of authz groups,
etc.  For example, institutional environments I've worked with want to
be able to assert some of the following statements:

 - The official originator of this data is Foo Department
 - Only Sue and Fred of Foo Department can change this data

If we moved to a workshop / workbench type model, which are
collections of  Resources as first-class citizens

> Hope this makes some sense,

I think so -- does my (re)interpretation above match your sense?

> [OT]
> re data wrangling:
>
> https://bitbucket.org/pudo/iati/src
> https://bitbucket.org/okfn/ukgov-25k-spending/src

These are very good use cases for the kinds of data wrangling we want
users to be able to do easily, I think.

Seb