[ckan-discuss] [ckan-dev] Package Resources Proposal

David Read david.read at okfn.org
Fri Feb 4 10:12:09 GMT 2011


On 2 February 2011 18:21, James Gardner <james at 3aims.com> wrote:

> Hi Rufus,
>
> > I'm going to focus entirely on the use cases / requirements, as these
> > should drive whatever implementation we choose to do.
>
> OK, but the three DGU/UKLP use cases outlined in the original are the
> most important (all copied from the original email) but you haven't
> responded to those.
>
> I'll respond to some of the other points, but these should be the main
> focus of discussion in that case:
>
> * Differentiating the type of resource
>
>      There is a need to
>      do this for UKLP where some of the data associated with a
>      package is metadata, some is data and some could be a service. We
>      could have different tables for the different types of resource
>      or keep things simple at this stage by just having a type.
>
> * Regular releases of the same dataset
>
>  The UK government requires departments to release certain data sets
>  once a month. They'd like to be able to highlight which datasets are
>  for which month rather than just have a huge list of resources. We
>  therefore need to be able to form a series of data within a *single*
>  package.
>
> * Group different formats of the same data together.
>
>  For example if the treasury publish spending for May 2010 in XLS, CSV
>  and XML format I want to know that the data in each of these files is
>  the same so that I can display them all under the "May 2010" label
>  obtained from ``data_group``. To do this we can add a new table named
>  ``data`` which would link individual resources to a data group.
>
>  There may also be times when we want to show that a set of data is
>  missing when it is supposed to be there. We can do this by having
>  an entry in the ``data`` table but associating no resources with it.
>
>  We may also want certain releases to be uploaded but not released
>  until a certain time. We can do that with a flag  on the ``data``
>  item rather than needing a flag on each resource.
>
> > One
> > small issue I have at the moment is the use of the term 'data' in the
> > use cases. I presume this basically mean 'single file containing data'
> > i.e. a resource. Comments below.
>
> No, "data" should refer to the abstract concept of some data then
> resource is a file implementing that data in a particular format.
>
> > Now sure what it means by "do not get duplicated". Would the ability
> > to sort the package resource table by specific fields not address
> > this?
>
> Just means so that you don't get 10 copies of each file displayed for a
> set of time-released data when you are only interested in the XML
> version for each. Having the abstract concept of "data" and then the
> different formats for the data allows this.
>
> > > 2. There needs to be a mechanism to timeseries data, so that
> > > search results only display the latest package. This needs
> > > to be done in a way that the older data is still easily
> > > accessible. This should be done with the minimum of user
> > > effort.
> >
> > More generally there are sets of data (e.g. wikipedia breaks up dumps
> > by page letter). However, isn't this exactly what packages with
> > multiple resources allow one to do?
>
> No because there is no structure explaining what data the files relate
> to, which are duplicates in different formats and which represent
> different data. That's all I'm trying to introduce here, a little bit of
> structure. I don't find a long list of files a good solution and it will
> get worse the longer a timeseries of data runs for.
>
> > So for time series (e.g. monthly release of unemployment data) these
> > would all go in the same package. Personally I think it would be more
> > useful for them to consolidate this data rather than have a huge list
> > of months -- e.g. one consolidated file with all data and, say, 2-3
> > separate files of the latest months.
>
> I agree, but one step at a time! Getting people to publish the data at
> all is the first step and a gap in the timeseries should act as a bit of
> an incentive.
>
> > > 3. The ordering of the data should be presented without the
> > > need for user input.
> >
> > What exactly does this mean. Resources are automatically ordered
> > without user input.
>
> If we know it is a timeseries, we shouldn't need the user to manually
> click up and down to specify that January comes before February.
>
> > > 4. There needs to more information stored against the data,
> > > beyond just its format and a description.
> >
> > Strongly agree and believe we have already just implemented the
> > ability to have arbitrary attributes on individual resources.
>
> Sure, as I mentioned, I'm happy for the resource attributes to use the
> key/value pairs. I don't feel putting the other structural elements in
> there too is the right solution, but we could do.
>
> > > 5. Users should be able to refashion the data and post a
> > > whole new set of this derived data.
> >
> > To give a concrete example: original data is excel and I want to
> > convert to google docs / csv / json / rdf. Definitely agree this very
> > important. In discussions in the past [1] have discucssed the 2
> > alternatives (both of which have happened on CKAN): a) new related
> > package b) new resource in existing package.
> >
> > [1]:
> http://lists.okfn.org/pipermail/ckan-discuss/2010-October/000651.html
>
> Great! Well this would give an option 3 of sharing at the data group
> level once we have a data_group and as explained in the original email,
> this might have some advantages mainly keeping a clear distinction
> between package owner and derived dataset owner whilst keeping them
> together in the interface and search. (Phase 3 thing though so can
> discuss more later).
>
> > > 6. Groups of data should be able to be synced across
> > > packages/instances. In order for derived data set to be
> > > associated with existing packages.
> >
> > Why? Not convinced about this one. Seems to only follow if we think
> > explicit groups are important.
>
> Groups such as timeseries. Which I think are important.
>
> > > 9. A place to put a time-to-release date against a piece of
> > > potential data.
> >
> > Package or 'data'? Is it really important to have this for a resource
> > rather than a package? If so can we do this via existing statefulness
> > of PackageResource association?
>
> Yes, because say we might not want the next months spending info to be
> displayed until midnight on the last day of the previous month. We don't
> want to hide the whole package while waiting for that to happen.
>

I think there are two use cases we have wanted to cover here:
1. a pending resource
2. a pending package - e.g. the OS data last April had an embargo on both
the files and the exact details of what was being released, so the whole
package and resources went live together at a certain time.

So we'd make use of pending state of both resources and packages.

Sorry, what is "statefulness of PackageResource association"?
>

Basically all the objects in CKAN are 'stateful' and most are 'revisioned'
too. Statefulness is the 'state' property which can be 'active', 'pending'
or 'deleted'. Each resource (connected to a package) is represented by a
PackageResource, which uses an SqlAlchemy association. Hope that clears it
up...


>
> > > 10. We need a marker to show that the certain data sets are
> > > missing.
> >
> > Can we have a concrete use case (this is a requirement not a use case ;)
> )
>
> This is the one of showing the next "data" item in a timeseries has no
> resources next to it to highlight someone has failed to publish it.
>

The 'marker' should be derived from information about when a release is due
by and the time coverage of existing data. I don't think it should some sort
of resource object as a placemarker, if that's what you're suggesting.

What we do need a placemarker for is a subtly different case, to signify
that there was no data for a period. A user must be able to create a
resource that doesn't have a URL, but does have a time-period. e.g. a month
where there are no spending transactions over £25k.


> > > 11. Dashboard of what data has been released, when is going
> > > to be released.
> >
> > Agree this is important but not sure what requirement this imposes on
> > 'resources' if any.
>
> You need to model the timeseries to know what is coming up ;) This
> proposal is the most correct and least impactful way I can think of
> doing that, but open to suggestions?
>
> Cheers,
>
> James
>
>
>
> _______________________________________________
> ckan-discuss mailing list
> ckan-discuss at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-discuss/attachments/20110204/5b82dafb/attachment-0001.htm>


More information about the ckan-discuss mailing list