[ckan-dev] Package Resources Proposal

Seb Bacon seb.bacon at gmail.com
Fri Feb 4 11:41:45 UTC 2011


Hi,

I'm finding this thread a bit confusing, as it's branched in various
directions very quickly.  So I thought I'd reply to James' original
message... and I'd like to re-interpret to make sure I've got this
right.  Please let me know!

CKAN is essentially and most importantly a registry of "instances of
data" -- what we call "Resources".  They are the heart of what we do.
The question is how we arrange, publish, and organise them.

We currently have the following ways of organising Resources:

 * Package: a group of packages that share a license, maintainer and
   author
 * Tag: arbitrary groupings of packages
 * Group: another arbitrary group of packages like tags, but with a
   title, description and date, and the creation/editing of which are
   subject to authorisation
 * Relationship:  a named relationship between two packages, with a comment

The proposal is partly to start to describe a resource's "type" (I see
this as uncontroversial -- right?), but mainly to introduce a new
grouping:

 * DataGroup: an arbitrary group of Resources

Personally, this seems completely sensible to me.  The only grouping
mechanisms we currently have are via Packages, but the only semantic
meaning to a package that I can devine is "Resources that share a
license and originator".  There are surely all kinds of use cases both
real and imaginable which require us to group Resources together which
don't share a license?

Indeed, from where I'm sitting, I don't actually quite understand the
concept of a package at all; couldn't we just have Resources, Authors,
Maintainers and Licenses?

As an aside: I think we could handle the publication embargo
requirement with something that applies not just to DataGroups, but
across the system; a "publication_date" on all entities (including
DataGroups) would presumably suffice?

As another aside, I don't quite understand the Use Cases for Groups,
but that doesn't matter so much right now...

Seb



On 2 February 2011 14:57, James Gardner <james at 3aims.com> wrote:
> Hi all,
>
> In conjunction with David Raznick I'd like to propose a two/three phase
> improvement to the way we handle package resources in CKAN. These
> changes are required to help deliver our work for data.gov.uk but I
> think they are useful enough for all CKAN instances that the changes
> should go in core.
>
> I'm sending this to ckan-discuss and ckan-dev because I'd like
> to know if anyone thinks that the suggestions in the second phase
> might actually make CKAN harder to understand and use, in which case
> we can stop at Phase 1 and just use the internal structures for DGU and
> UKLP. I'd also like to know if you think the phase 1 structures would
> support other projects too.
>
> There is a related ticket at http://ckan.org/ticket/945 and page at
> http://ckan.org/wiki/UseCasesResources with further ideas but I hope
> the info below will summarise the key aspects.
>
> Please see below...
>
> Cheers,
>
> James
>
>
>
>
> Phase 1
> =======
>
> In phase 1 nothing in CKAN will change either in the API or the UI.
> The form will be the same. The only changes will be under the hood and
> they will be to:
>
> * Give resources new attributes for:
>
>  type
>      Can be "data", "metadata", "service", NULL. There is a need to
>      do this for UKLP where some of the data associated with a
>      package is metadata, some is data and some could be a service. We
>      could have different tables for the different types of resource
>      or keep things simple at this stage by just having a type.
>
>  size
>      The size in bytes for "data" or "metadata" resources, can be
>      NULL
>
>  last_present
>      The last time the resource was able to be downloaded by the link
>      checker
>
>  openness
>      A rating of how open the resource is (5 stars of openness), 0
>      if the resource currently isn't present.
>
>  Existing package resources would get a type of NULL and the link
>  checker would quickly fill in the rest.
>
>  The attributes could be implemented with David's existing work for
>  key/value pairs on resources rather than be columns in their own
>  right.
>
> * Group sets of related data together
>
>  The UK government requires departments to release certain data sets
>  once a month. They'd like to be able to highlight which datasets are
>  for which month rather than just have a huge list of resources. We
>  therefore need to be able to form a series of data within a *single*
>  package.
>
>  We'd introduce a ``data_group`` table to handle this. It could link
>  to the package and have columns:
>
>  package_id
>      foreign key to package
>
>  label
>      a label that describes this entry in the data_group eg the date
>      "May 2010", can be NULL
>
>  sort_order
>      a string in that column would be used to determine the sort order
>      eg the string "2010-05", can be NULL
>
>  The alternative way of implementing the same thing is to have a
>  package for each release but some series are huge and would swamp
>  the other types of package.
>
>  I'd rather keep the existing package relationship functionality for
>  other types of association such as "related to", "depends on" or even
>  "you might also be interested in".
>
> * Group different formats of the same data together.
>
>  For example if the treasury publish spending for May 2010 in XLS, CSV
>  and XML format I want to know that the data in each of these files is
>  the same so that I can display them all under the "May 2010" label
>  obtained from ``data_group``. To do this we can add a new table named
>  ``data`` which would link individual resources to a data group.
>
>  There may also be times when we want to show that a set of data is
>  missing when it is supposed to be there. We can do this by having
>  an entry in the ``data`` table but associating no resources with it.
>
>  We may also want certain releases to be uploaded but not released
>  until a certain time. We can do that with a flag  on the ``data``
>  item rather than needing a flag on each resource.
>
>  Columns could be:
>
>  data_id
>      Foreign key to the data table
>
>  visible_after
>      Date, can be NULL
>
>  The resource table would then have a foreign key to ``data.id``
>  rather than ``package.id``
>
> Summary:
>
>  Replace the ``package_resources`` table with three tables:
>
>  * ``data_group``
>  * ``data``
>  * ``resource``
>
>  The one drawback is that adding a single resource to a package
>  requires three table entries now rather than one. This would all
>  happen behind the scenes though in Phase 1 and not affect the
>  UI or API.
>
> Phase 2
> =======
>
> Once we have this structure in place and supporting the government we
> can look at ways CKAN could benefit from exposing some of the same
> features to users via a UI.
>
> We could allow users to group resources that represent the same data
> but in different formats together via drag and drop.
>
> Phase 3
> =======
>
> We could also look at being more radical still:
>
> * Treat resources as first class objects so that each data set is only
>  stored once.
>
>  If two packages link to the same resource, there will
>  still only be one entry for it in the database. This would mean all
>  the tools we are building to check links, and rate data for its
>  openness can just run against the unique ``resource`` table records
>  without wasting time with duplicated.
>
>  This would have some overhead managing orphan resources though.
>
> * Allow multiple data_groups per package.
>
>  This would allow things like derived data sets to be associated with
>  the same package the source data came from without needing a new
>  package
>
> * Allow sharing of data groups
>
>  We could even allow package authors to embed data_groups from other
>  packages. This gives a clear accountability model because the person
>  submitting the package is still in charge of it, whilst still allowing
>  other people to have some control over one part of the data.
>
>  This may be getting a bit complex though!!!
>
>
>
>
>
>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev
>




More information about the ckan-dev mailing list