[ckan-discuss] Package Resources Proposal
james at 3aims.com
Wed Feb 2 14:57:34 GMT 2011
In conjunction with David Raznick I'd like to propose a two/three phase
improvement to the way we handle package resources in CKAN. These
changes are required to help deliver our work for data.gov.uk but I
think they are useful enough for all CKAN instances that the changes
should go in core.
I'm sending this to ckan-discuss and ckan-dev because I'd like
to know if anyone thinks that the suggestions in the second phase
might actually make CKAN harder to understand and use, in which case
we can stop at Phase 1 and just use the internal structures for DGU and
UKLP. I'd also like to know if you think the phase 1 structures would
support other projects too.
There is a related ticket at http://ckan.org/ticket/945 and page at
http://ckan.org/wiki/UseCasesResources with further ideas but I hope
the info below will summarise the key aspects.
Please see below...
In phase 1 nothing in CKAN will change either in the API or the UI.
The form will be the same. The only changes will be under the hood and
they will be to:
* Give resources new attributes for:
Can be "data", "metadata", "service", NULL. There is a need to
do this for UKLP where some of the data associated with a
package is metadata, some is data and some could be a service. We
could have different tables for the different types of resource
or keep things simple at this stage by just having a type.
The size in bytes for "data" or "metadata" resources, can be
The last time the resource was able to be downloaded by the link
A rating of how open the resource is (5 stars of openness), 0
if the resource currently isn't present.
Existing package resources would get a type of NULL and the link
checker would quickly fill in the rest.
The attributes could be implemented with David's existing work for
key/value pairs on resources rather than be columns in their own
* Group sets of related data together
The UK government requires departments to release certain data sets
once a month. They'd like to be able to highlight which datasets are
for which month rather than just have a huge list of resources. We
therefore need to be able to form a series of data within a *single*
We'd introduce a ``data_group`` table to handle this. It could link
to the package and have columns:
foreign key to package
a label that describes this entry in the data_group eg the date
"May 2010", can be NULL
a string in that column would be used to determine the sort order
eg the string "2010-05", can be NULL
The alternative way of implementing the same thing is to have a
package for each release but some series are huge and would swamp
the other types of package.
I'd rather keep the existing package relationship functionality for
other types of association such as "related to", "depends on" or even
"you might also be interested in".
* Group different formats of the same data together.
For example if the treasury publish spending for May 2010 in XLS, CSV
and XML format I want to know that the data in each of these files is
the same so that I can display them all under the "May 2010" label
obtained from ``data_group``. To do this we can add a new table named
``data`` which would link individual resources to a data group.
There may also be times when we want to show that a set of data is
missing when it is supposed to be there. We can do this by having
an entry in the ``data`` table but associating no resources with it.
We may also want certain releases to be uploaded but not released
until a certain time. We can do that with a flag on the ``data``
item rather than needing a flag on each resource.
Columns could be:
Foreign key to the data table
Date, can be NULL
The resource table would then have a foreign key to ``data.id``
rather than ``package.id``
Replace the ``package_resources`` table with three tables:
The one drawback is that adding a single resource to a package
requires three table entries now rather than one. This would all
happen behind the scenes though in Phase 1 and not affect the
UI or API.
Once we have this structure in place and supporting the government we
can look at ways CKAN could benefit from exposing some of the same
features to users via a UI.
We could allow users to group resources that represent the same data
but in different formats together via drag and drop.
We could also look at being more radical still:
* Treat resources as first class objects so that each data set is only
If two packages link to the same resource, there will
still only be one entry for it in the database. This would mean all
the tools we are building to check links, and rate data for its
openness can just run against the unique ``resource`` table records
without wasting time with duplicated.
This would have some overhead managing orphan resources though.
* Allow multiple data_groups per package.
This would allow things like derived data sets to be associated with
the same package the source data came from without needing a new
* Allow sharing of data groups
We could even allow package authors to embed data_groups from other
packages. This gives a clear accountability model because the person
submitting the package is still in charge of it, whilst still allowing
other people to have some control over one part of the data.
This may be getting a bit complex though!!!
More information about the ckan-discuss