[ckan-discuss] Package Resources Proposal

Wed Feb 2 14:57:34 GMT 2011

Hi all,

In conjunction with David Raznick I'd like to propose a two/three phase
improvement to the way we handle package resources in CKAN. These
changes are required to help deliver our work for data.gov.uk but I
think they are useful enough for all CKAN instances that the changes
should go in core. 

I'm sending this to ckan-discuss and ckan-dev because I'd like
to know if anyone thinks that the suggestions in the second phase
might actually make CKAN harder to understand and use, in which case
we can stop at Phase 1 and just use the internal structures for DGU and
UKLP. I'd also like to know if you think the phase 1 structures would
support other projects too.

There is a related ticket at http://ckan.org/ticket/945 and page at
http://ckan.org/wiki/UseCasesResources with further ideas but I hope
the info below will summarise the key aspects.

Please see below...

Cheers,

James

Phase 1
=======

In phase 1 nothing in CKAN will change either in the API or the UI.
The form will be the same. The only changes will be under the hood and
they will be to:

* Give resources new attributes for:

  type
      Can be "data", "metadata", "service", NULL. There is a need to
      do this for UKLP where some of the data associated with a
      package is metadata, some is data and some could be a service. We
      could have different tables for the different types of resource
      or keep things simple at this stage by just having a type.

  size
      The size in bytes for "data" or "metadata" resources, can be
      NULL

  last_present
      The last time the resource was able to be downloaded by the link
      checker

  openness
      A rating of how open the resource is (5 stars of openness), 0
      if the resource currently isn't present.

  Existing package resources would get a type of NULL and the link
  checker would quickly fill in the rest.

  The attributes could be implemented with David's existing work for 
  key/value pairs on resources rather than be columns in their own
  right.

* Group sets of related data together

  The UK government requires departments to release certain data sets
  once a month. They'd like to be able to highlight which datasets are
  for which month rather than just have a huge list of resources. We
  therefore need to be able to form a series of data within a *single*
  package. 

  We'd introduce a ``data_group`` table to handle this. It could link
  to the package and have columns:

  package_id
      foreign key to package

  label 
      a label that describes this entry in the data_group eg the date 
      "May 2010", can be NULL

  sort_order 
      a string in that column would be used to determine the sort order
      eg the string "2010-05", can be NULL

  The alternative way of implementing the same thing is to have a
  package for each release but some series are huge and would swamp
  the other types of package. 

  I'd rather keep the existing package relationship functionality for
  other types of association such as "related to", "depends on" or even
  "you might also be interested in".

* Group different formats of the same data together.

  For example if the treasury publish spending for May 2010 in XLS, CSV
  and XML format I want to know that the data in each of these files is
  the same so that I can display them all under the "May 2010" label
  obtained from ``data_group``. To do this we can add a new table named
  ``data`` which would link individual resources to a data group.

  There may also be times when we want to show that a set of data is 
  missing when it is supposed to be there. We can do this by having 
  an entry in the ``data`` table but associating no resources with it.

  We may also want certain releases to be uploaded but not released
  until a certain time. We can do that with a flag  on the ``data``
  item rather than needing a flag on each resource.

  Columns could be:

  data_id
      Foreign key to the data table

  visible_after
      Date, can be NULL

  The resource table would then have a foreign key to ``data.id``
  rather than ``package.id``

Summary:

  Replace the ``package_resources`` table with three tables:

  * ``data_group``
  * ``data``
  * ``resource``

  The one drawback is that adding a single resource to a package
  requires three table entries now rather than one. This would all
  happen behind the scenes though in Phase 1 and not affect the 
  UI or API.

Phase 2
=======

Once we have this structure in place and supporting the government we
can look at ways CKAN could benefit from exposing some of the same
features to users via a UI.

We could allow users to group resources that represent the same data
but in different formats together via drag and drop.

Phase 3
=======

We could also look at being more radical still:

* Treat resources as first class objects so that each data set is only
  stored once.

  If two packages link to the same resource, there will
  still only be one entry for it in the database. This would mean all 
  the tools we are building to check links, and rate data for its
  openness can just run against the unique ``resource`` table records
  without wasting time with duplicated.

  This would have some overhead managing orphan resources though.

* Allow multiple data_groups per package.

  This would allow things like derived data sets to be associated with
  the same package the source data came from without needing a new
  package 

* Allow sharing of data groups

  We could even allow package authors to embed data_groups from other
  packages. This gives a clear accountability model because the person
  submitting the package is still in charge of it, whilst still allowing
  other people to have some control over one part of the data.

  This may be getting a bit complex though!!!