[ckan-discuss] implied datasets

Tue May 24 09:36:38 BST 2011

On 23 May 2011 13:41, William Waites <ww at styx.org> wrote:

> Hello all,
>
> I'm writing two versions of this question. It is motivated by an RDF
> case but it is more general so I put it to the ckan list in other
> terms. If you look at this graph here,
>
>    http://semantic.ckan.net/group/?group=http://ckan.net/group/lld
>
> you'll see that there is an obvious partition, with a group of three
> large datasets floating off to the side. This is not really a
> partition.
>
> Suppose the datasets we are considering are CSV files. Suppose we
> have two CSV files and they have a column in common, something that
> is useful for doing a join, like an ISBN or an ISSN or whatever. We
> might want to say, in CKAN, that these two datasets are connected,
> that you can sensibly do a join operation in a certain way in a
> certain column. We might use the package relationship thing for this,
> or we might adopt a convention for using the extras for it, it doesn't
> really matter.
>
> What does matter is when we add a third dataset that also has a column
> with the same meaning. Now, how do we express that you can join it
> with the others? If we do the same thing with the extras we quickly
> get into a situation where we have O(n^2) such links in the number of
> datasets. If we make fewer explicit links, users of this data have to
> have some way of dealing with transitivity but that only works iff all
> the datasets in question have the same entries in that column.
>
> So what I suggest to do for this case is invent a dataset that doesn't
> really exist but could in principle (though it's probably not worth
> the trouble to create it explicitly and maintain it). This virtual
> dataset would have a single column with the same meaning as these
> common columns and it would have the union of all the values for that
> column in any of the datasets.
>
> Then each of the datasets would have an extra or relationship or
> whatever that says, you can sensibly join column X with this
> dataset. A transitive join btween any two real datasets through this
> virtual one makes sense because it has all possible values, and we
> only have to maintain O(n) extras and a small amount of extra metadata
> for the virtual dataset.
>

I like this idea.

> Is this an acceptable use of CKAN, to record a dataset that doesn't
> actually exist but could in principle for this purpose?
>

I think so. I think this is similar to (but different) from the concept of
virtual or meta packages in standard software packaging (as in e.g. debian).
There one often has virtual packages which don't really exist but which is
'provided' by some other package (a classic example is 'Email' which is
provided by 'Postfix', 'Exim' etc). This again solves an O(n^2) problem
where you don't want something that requires email to have to say: i need
one of the following N packages. Instead it can just require the 'virtual'
Email package with this requirement satisfied by some specific other package
such as Postfix.

Rufus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-discuss/attachments/20110524/4b24ae55/attachment.htm>