[ckan-dev] Package Resources Proposal

Fri Feb 4 11:55:47 UTC 2011

Hi Seb,

> CKAN is essentially and most importantly a registry of "instances of
> data" -- what we call "Resources".  They are the heart of what we do.
> The question is how we arrange, publish, and organise them.

Exactly.

> We currently have the following ways of organising Resources:
>
>   * Package: a group of packages that share a license, maintainer and
>     author

Yep, although the terminology would be group of "resources" here.

>   * Tag: arbitrary groupings of packages
>   * Group: another arbitrary group of packages like tags, but with a
>     title, description and date, and the creation/editing of which are
>     subject to authorisation

Yep, Group is sometimes used to group sets of packages together, eg the 
LOD cloud on ckan.net:  http://ckan.net/group/lodcloud and sometimes 
used as a way of grouping publishers (eg IATI as I understand).
>   * Relationship:  a named relationship between two packages, with a comment

Yep.

> The proposal is partly to start to describe a resource's "type" (I see
> this as uncontroversial -- right?), but mainly to introduce a new
> grouping:
>
>   * DataGroup: an arbitrary group of Resources
>
> Personally, this seems completely sensible to me.  The only grouping
> mechanisms we currently have are via Packages, but the only semantic
> meaning to a package that I can devine is "Resources that share a
> license and originator".  There are surely all kinds of use cases both
> real and imaginable which require us to group Resources together which
> don't share a license?

Exactly. Rufus, David Raznick and I have just agreed a compromise though 
to introduce just a resource_group table at this stage. It can be used 
*either* for grouping sets of resources (files) together to represent 
the same data *or* for timeseries of data. You can't do a timeseries 
with groups of related files easily in this model, but actually there is 
no immediate requirement to do so and when one comes up we can look at 
the problem again. Each package can have multiple resource_groups.

> Indeed, from where I'm sitting, I don't actually quite understand the
> concept of a package at all; couldn't we just have Resources, Authors,
> Maintainers and Licenses?

Yes, that's even more radical though ;)

> As an aside: I think we could handle the publication embargo
> requirement with something that applies not just to DataGroups, but
> across the system; a "publication_date" on all entities (including
> DataGroups) would presumably suffice?

True, although we don't want to add too many features that won't be used 
by most instances to the main code, we'll wait for a clearer requirement 
for that.

> As another aside, I don't quite understand the Use Cases for Groups,
> but that doesn't matter so much right now...

Well, they serve the publisher model to some extend too. Any other use 
cases?

Cheers,

James

>
>
>
> On 2 February 2011 14:57, James Gardner<james at 3aims.com>  wrote:
>> Hi all,
>>
>> In conjunction with David Raznick I'd like to propose a two/three phase
>> improvement to the way we handle package resources in CKAN. These
>> changes are required to help deliver our work for data.gov.uk but I
>> think they are useful enough for all CKAN instances that the changes
>> should go in core.
>>
>> I'm sending this to ckan-discuss and ckan-dev because I'd like
>> to know if anyone thinks that the suggestions in the second phase
>> might actually make CKAN harder to understand and use, in which case
>> we can stop at Phase 1 and just use the internal structures for DGU and
>> UKLP. I'd also like to know if you think the phase 1 structures would
>> support other projects too.
>>
>> There is a related ticket at http://ckan.org/ticket/945 and page at
>> http://ckan.org/wiki/UseCasesResources with further ideas but I hope
>> the info below will summarise the key aspects.
>>
>> Please see below...
>>
>> Cheers,
>>
>> James
>>
>>
>>
>>
>> Phase 1
>> =======
>>
>> In phase 1 nothing in CKAN will change either in the API or the UI.
>> The form will be the same. The only changes will be under the hood and
>> they will be to:
>>
>> * Give resources new attributes for:
>>
>>   type
>>       Can be "data", "metadata", "service", NULL. There is a need to
>>       do this for UKLP where some of the data associated with a
>>       package is metadata, some is data and some could be a service. We
>>       could have different tables for the different types of resource
>>       or keep things simple at this stage by just having a type.
>>
>>   size
>>       The size in bytes for "data" or "metadata" resources, can be
>>       NULL
>>
>>   last_present
>>       The last time the resource was able to be downloaded by the link
>>       checker
>>
>>   openness
>>       A rating of how open the resource is (5 stars of openness), 0
>>       if the resource currently isn't present.
>>
>>   Existing package resources would get a type of NULL and the link
>>   checker would quickly fill in the rest.
>>
>>   The attributes could be implemented with David's existing work for
>>   key/value pairs on resources rather than be columns in their own
>>   right.
>>
>> * Group sets of related data together
>>
>>   The UK government requires departments to release certain data sets
>>   once a month. They'd like to be able to highlight which datasets are
>>   for which month rather than just have a huge list of resources. We
>>   therefore need to be able to form a series of data within a *single*
>>   package.
>>
>>   We'd introduce a ``data_group`` table to handle this. It could link
>>   to the package and have columns:
>>
>>   package_id
>>       foreign key to package
>>
>>   label
>>       a label that describes this entry in the data_group eg the date
>>       "May 2010", can be NULL
>>
>>   sort_order
>>       a string in that column would be used to determine the sort order
>>       eg the string "2010-05", can be NULL
>>
>>   The alternative way of implementing the same thing is to have a
>>   package for each release but some series are huge and would swamp
>>   the other types of package.
>>
>>   I'd rather keep the existing package relationship functionality for
>>   other types of association such as "related to", "depends on" or even
>>   "you might also be interested in".
>>
>> * Group different formats of the same data together.
>>
>>   For example if the treasury publish spending for May 2010 in XLS, CSV
>>   and XML format I want to know that the data in each of these files is
>>   the same so that I can display them all under the "May 2010" label
>>   obtained from ``data_group``. To do this we can add a new table named
>>   ``data`` which would link individual resources to a data group.
>>
>>   There may also be times when we want to show that a set of data is
>>   missing when it is supposed to be there. We can do this by having
>>   an entry in the ``data`` table but associating no resources with it.
>>
>>   We may also want certain releases to be uploaded but not released
>>   until a certain time. We can do that with a flag  on the ``data``
>>   item rather than needing a flag on each resource.
>>
>>   Columns could be:
>>
>>   data_id
>>       Foreign key to the data table
>>
>>   visible_after
>>       Date, can be NULL
>>
>>   The resource table would then have a foreign key to ``data.id``
>>   rather than ``package.id``
>>
>> Summary:
>>
>>   Replace the ``package_resources`` table with three tables:
>>
>>   * ``data_group``
>>   * ``data``
>>   * ``resource``
>>
>>   The one drawback is that adding a single resource to a package
>>   requires three table entries now rather than one. This would all
>>   happen behind the scenes though in Phase 1 and not affect the
>>   UI or API.
>>
>> Phase 2
>> =======
>>
>> Once we have this structure in place and supporting the government we
>> can look at ways CKAN could benefit from exposing some of the same
>> features to users via a UI.
>>
>> We could allow users to group resources that represent the same data
>> but in different formats together via drag and drop.
>>
>> Phase 3
>> =======
>>
>> We could also look at being more radical still:
>>
>> * Treat resources as first class objects so that each data set is only
>>   stored once.
>>
>>   If two packages link to the same resource, there will
>>   still only be one entry for it in the database. This would mean all
>>   the tools we are building to check links, and rate data for its
>>   openness can just run against the unique ``resource`` table records
>>   without wasting time with duplicated.
>>
>>   This would have some overhead managing orphan resources though.
>>
>> * Allow multiple data_groups per package.
>>
>>   This would allow things like derived data sets to be associated with
>>   the same package the source data came from without needing a new
>>   package
>>
>> * Allow sharing of data groups
>>
>>   We could even allow package authors to embed data_groups from other
>>   packages. This gives a clear accountability model because the person
>>   submitting the package is still in charge of it, whilst still allowing
>>   other people to have some control over one part of the data.
>>
>>   This may be getting a bit complex though!!!
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/ckan-dev
>>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev