[ckan-discuss] [ckan-dev] Package Resources Proposal

Wed Feb 2 18:21:52 GMT 2011

Hi Rufus,

> I'm going to focus entirely on the use cases / requirements, as these
> should drive whatever implementation we choose to do.

OK, but the three DGU/UKLP use cases outlined in the original are the
most important (all copied from the original email) but you haven't
responded to those. 

I'll respond to some of the other points, but these should be the main
focus of discussion in that case:

* Differentiating the type of resource

      There is a need to
      do this for UKLP where some of the data associated with a
      package is metadata, some is data and some could be a service. We
      could have different tables for the different types of resource
      or keep things simple at this stage by just having a type.

* Regular releases of the same dataset

  The UK government requires departments to release certain data sets
  once a month. They'd like to be able to highlight which datasets are
  for which month rather than just have a huge list of resources. We
  therefore need to be able to form a series of data within a *single*
  package. 

* Group different formats of the same data together.

  For example if the treasury publish spending for May 2010 in XLS, CSV
  and XML format I want to know that the data in each of these files is
  the same so that I can display them all under the "May 2010" label
  obtained from ``data_group``. To do this we can add a new table named
  ``data`` which would link individual resources to a data group.

  There may also be times when we want to show that a set of data is 
  missing when it is supposed to be there. We can do this by having 
  an entry in the ``data`` table but associating no resources with it.

  We may also want certain releases to be uploaded but not released
  until a certain time. We can do that with a flag  on the ``data``
  item rather than needing a flag on each resource.

> One
> small issue I have at the moment is the use of the term 'data' in the
> use cases. I presume this basically mean 'single file containing data'
> i.e. a resource. Comments below.

No, "data" should refer to the abstract concept of some data then
resource is a file implementing that data in a particular format.

> Now sure what it means by "do not get duplicated". Would the ability
> to sort the package resource table by specific fields not address
> this?

Just means so that you don't get 10 copies of each file displayed for a
set of time-released data when you are only interested in the XML
version for each. Having the abstract concept of "data" and then the
different formats for the data allows this.

> > 2. There needs to be a mechanism to timeseries data, so that
> > search results only display the latest package. This needs
> > to be done in a way that the older data is still easily
> > accessible. This should be done with the minimum of user
> > effort.
> 
> More generally there are sets of data (e.g. wikipedia breaks up dumps
> by page letter). However, isn't this exactly what packages with
> multiple resources allow one to do?

No because there is no structure explaining what data the files relate
to, which are duplicates in different formats and which represent
different data. That's all I'm trying to introduce here, a little bit of
structure. I don't find a long list of files a good solution and it will
get worse the longer a timeseries of data runs for.

> So for time series (e.g. monthly release of unemployment data) these
> would all go in the same package. Personally I think it would be more
> useful for them to consolidate this data rather than have a huge list
> of months -- e.g. one consolidated file with all data and, say, 2-3
> separate files of the latest months.

I agree, but one step at a time! Getting people to publish the data at
all is the first step and a gap in the timeseries should act as a bit of
an incentive.

> > 3. The ordering of the data should be presented without the
> > need for user input.
> 
> What exactly does this mean. Resources are automatically ordered
> without user input.

If we know it is a timeseries, we shouldn't need the user to manually
click up and down to specify that January comes before February.

> > 4. There needs to more information stored against the data,
> > beyond just its format and a description.
> 
> Strongly agree and believe we have already just implemented the
> ability to have arbitrary attributes on individual resources.

Sure, as I mentioned, I'm happy for the resource attributes to use the
key/value pairs. I don't feel putting the other structural elements in
there too is the right solution, but we could do.

> > 5. Users should be able to refashion the data and post a
> > whole new set of this derived data.
> 
> To give a concrete example: original data is excel and I want to
> convert to google docs / csv / json / rdf. Definitely agree this very
> important. In discussions in the past [1] have discucssed the 2
> alternatives (both of which have happened on CKAN): a) new related
> package b) new resource in existing package.
> 
> [1]: http://lists.okfn.org/pipermail/ckan-discuss/2010-October/000651.html

Great! Well this would give an option 3 of sharing at the data group
level once we have a data_group and as explained in the original email,
this might have some advantages mainly keeping a clear distinction
between package owner and derived dataset owner whilst keeping them
together in the interface and search. (Phase 3 thing though so can
discuss more later).

> > 6. Groups of data should be able to be synced across
> > packages/instances. In order for derived data set to be
> > associated with existing packages.
> 
> Why? Not convinced about this one. Seems to only follow if we think
> explicit groups are important.

Groups such as timeseries. Which I think are important.

> > 9. A place to put a time-to-release date against a piece of
> > potential data.
> 
> Package or 'data'? Is it really important to have this for a resource
> rather than a package? If so can we do this via existing statefulness
> of PackageResource association?

Yes, because say we might not want the next months spending info to be
displayed until midnight on the last day of the previous month. We don't
want to hide the whole package while waiting for that to happen.

Sorry, what is "statefulness of PackageResource association"?

> > 10. We need a marker to show that the certain data sets are
> > missing.
> 
> Can we have a concrete use case (this is a requirement not a use case ;) )

This is the one of showing the next "data" item in a timeseries has no
resources next to it to highlight someone has failed to publish it.

> > 11. Dashboard of what data has been released, when is going
> > to be released.
> 
> Agree this is important but not sure what requirement this imposes on
> 'resources' if any.

You need to model the timeseries to know what is coming up ;) This
proposal is the most correct and least impactful way I can think of
doing that, but open to suggestions?

Cheers,

James