[ckan-discuss] CKAN for chemistry

Rufus Pollock rufus.pollock at okfn.org
Tue Feb 22 20:23:45 GMT 2011


On 21 February 2011 13:29, Egon Willighagen <egon.willighagen at gmail.com>wrote:

> On Mon, Feb 21, 2011 at 2:10 PM, Jonathan Gray <jonathan.gray at okfn.org>
> wrote:
> > Egon: would be great to have any further input from you on what
> > changes you'd suggest on the basis of your experiences as a user!
>

Really great to have your input and feedback Egon.


> I would first like to get some consensus on how CKAN should be used. A
> clear definition or description of what should go into one record
> would be important at this moment.


This is a very good question and should definitely become an FAQ item,
perhaps "What is a CKAN Package?"!

The current heterogeneity in the use of CKAN 'packages' [1] is to some
extent a reflection of the heterogeneity of the current world of data.
Here's my view on how the system currently works and where it should go:

[1]: The main menu link says 'Add a dataset' on http://ckan.net/ but this is
a very recent, and experimental, user interface change on
http://ckan.netand the default install still says (the, perhaps, more
accurate) 'Add a
Package'.

### What we have now

* Resources  = a single file/API/etc. This is the fundamental building
block.

* Package = a ('coherent') collection of Resources -- 'a bunch of stuff you
want to get together or keep together'. This can cover (at least) 2 distinct
cases: a) one 'Dataset' split into lots of files (e.g. a wikipedia dump with
the dump split by page number) b) related resources e.g. the 'same' data in
different formats or in different forms.

There is also some clear abuse of 'Packages' to represent 'groups' of
datasets / packages that are only very loosely associated.

In your article you mention LODD but I personally contributed to this
category :) in the form of registering some listings as packages -- see the
tag: <http://ckan.net/tag/package-type.listing> and <
http://ckan.net/tag/package-type-listing> (though strictly ckan is a real
dataset in its own right!).

Going forward this kind of 'abuse' should stop and these cases should be
handled by "groups".

### Some background

The original idea of a 'Package' is that a data 'package' would be like a
software package -- a set of material that together did something
useful. Originally we did not have Resources because a Package could only
have one associated file ('download_url' attribute) and hence was like a
resource.

### The future

Here are some proposals for changes going forward. Please say what you
think!

0. Produce a guide on how this should work on http://wiki.ckan.net/

1. Make resources (more) first class entities.

 * Resources should get many of the attributes associated to 'Packages',
perhaps, most importantly license.
 * Resource should be directly registrable, uploading and searchable

2. Re-Naming.

While I like the idea and term 'Package' I'm concerned it doesn't mean much
to the average user :) The same could also be said of Resource but it is not
as bad. I therefore suggest renaming Package to Resource Collection.
(Alternative would be to rename Package to Dataset and have Dataset contain
multiple Resources).

[optional]: Rename Resource to Dataset. (I'm not sure about this as Dataset
is an ambiguous term that often means something with multiple resources a la
wikipedia example above).

[alternative]: You mention Record or CatalogRecord (from dcat). This is also
a possibility though not one I particularly like atm.

3. Add an extra level between Resources and Packages. It is often useful to
collect Resources together within Packages (e.g. the different formats of a
given Resource). These would be termed resource groups. For more on this see
this proposal from earlier in the month: <
http://lists.okfn.org/pipermail/ckan-discuss/2011-February/000892.html>


- how should mere dataset aggregations be handled? (Bio2RDF versus ChEMBL)
>

Depends what it is. If it is actually a new 'consolidated' dataset/package
then this should be a new package. Otherwise let's use groups.


> - how should alternative APIs be handled? (ChEMBL @ EBI versus a
> SPARQL end point)
>

Depends who the maintainer is. If one has an RDF dataset plus SPARQL API i'd
suggested putting these as different resources in the same package.
Otherwise different packages.


> The problem lies in that a record has a single maintainer, single
> license, single version associated, making it quite like a "data set
> instance". Then again, with so many SPARQL end points currently being
> separate from the upstream or original dataset, this does require some
> means of linking things together.
>

So here I think one wants different packages but with a relationships
between those packages.


> catalogRecord:X :derivedFrom catalogRecordY
>

That's what we can use Package Relationships (or perhaps resource
relationships!) for. See <http://wiki.ckan.net/Package_Relationships>


> At the same time, and think I have seen this used, the SPARQL end
> point might be a mere URL along with the 'original' dataset.
>

See previous comments.


> It seems to me that people have been using custom fields, just to make
> the data somewhat consistent.
>

I think it would also be useful to produce a guide for work in a given area
-- it is clear there is going to be some variation between people working
in, say, Chemistry and Economics.


> It would be good that it is more clear how the catalog should be
> filled, and datasets properly annotated, before we start that
> LODD/IsItOpenData hacking session in March...


A big +1

Egon
>
> PS. Simple that is very simple to fix, is to add this combined license:
>
> "Creative Commons Attribution Share-Alike"
>
> which is used by, for example, ChEMBL.
>

That license is in fact already in there -- the OKD-Compliant::Creative
Commons Sharealike is in fact Attribution-Sharealike (there is no CC
Sharealike w/o attribution). To clarify this the naming of the license has
now been corrected.

PPS: one question you mention in your article was how to delete things. See
this FAQ:

<http://wiki.ckan.net/FAQ#How_Do_I_Deal_with_Duplicate_Packages>

Rufus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-discuss/attachments/20110222/2873626f/attachment.htm>


More information about the ckan-discuss mailing list