[ckan-discuss] [ckan-dev] Package Resources Proposal

Rufus Pollock rufus.pollock at okfn.org
Wed Feb 2 16:54:28 GMT 2011


I'm going to keep the discussion on this on CKAN discuss for the
moment as this, at least initially, a question about what we CKAN to
do.

On 2 February 2011 14:57, James Gardner <james at 3aims.com> wrote:
> Hi all,
>
> In conjunction with David Raznick I'd like to propose a two/three phase
> improvement to the way we handle package resources in CKAN.

[...]

I'm going to focus entirely on the use cases / requirements, as these
should drive whatever implementation we choose to do. I've copied and
pasted from the wiki <http://ckan.org/wiki/UseCasesResources>. One
small issue I have at the moment is the use of the term 'data' in the
use cases. I presume this basically mean 'single file containing data'
i.e. a resource. Comments below.

Rufus

> 1. Data should be able to be grouped for the same
> information in different formats, so you do not get
> duplicated data.

Now sure what it means by "do not get duplicated". Would the ability
to sort the package resource table by specific fields not address
this?

> 2. There needs to be a mechanism to timeseries data, so that
> search results only display the latest package. This needs
> to be done in a way that the older data is still easily
> accessible. This should be done with the minimum of user
> effort.

More generally there are sets of data (e.g. wikipedia breaks up dumps
by page letter). However, isn't this exactly what packages with
multiple resources allow one to do?

So for time series (e.g. monthly release of unemployment data) these
would all go in the same package. Personally I think it would be more
useful for them to consolidate this data rather than have a huge list
of months -- e.g. one consolidated file with all data and, say, 2-3
separate files of the latest months.

> 3. The ordering of the data should be presented without the
> need for user input.

What exactly does this mean. Resources are automatically ordered
without user input.

> 4. There needs to more information stored against the data,
> beyond just its format and a description.

Strongly agree and believe we have already just implemented the
ability to have arbitrary attributes on individual resources.

> 5. Users should be able to refashion the data and post a
> whole new set of this derived data.

To give a concrete example: original data is excel and I want to
convert to google docs / csv / json / rdf. Definitely agree this very
important. In discussions in the past [1] have discucssed the 2
alternatives (both of which have happened on CKAN): a) new related
package b) new resource in existing package.

[1]: http://lists.okfn.org/pipermail/ckan-discuss/2010-October/000651.html

> 6. Groups of data should be able to be synced across
> packages/instances. In order for derived data set to be
> associated with existing packages.

Why? Not convinced about this one. Seems to only follow if we think
explicit groups are important.

> 7. When new versions of the data arrive it should be easy to
> copy an old one and change versions/dates as required.

What is the 'old one'? Old resource or old package or ...? Seems this
could be done already. Also beware of terms like 'should be easy' :)

> 8. A user may want to upload a resource separately from the
> package and decide later on where the best place for it is.

Agreed. Resources should be first class entities.

> 9. A place to put a time-to-release date against a piece of
> potential data.

Package or 'data'? Is it really important to have this for a resource
rather than a package? If so can we do this via existing statefulness
of PackageResource association?

> 10. We need a marker to show that the certain data sets are
> missing.

Can we have a concrete use case (this is a requirement not a use case ;) )

> 11. Dashboard of what data has been released, when is going
> to be released.

Agree this is important but not sure what requirement this imposes on
'resources' if any.



More information about the ckan-discuss mailing list