[ckan-discuss] implied datasets

Mon May 23 13:41:16 BST 2011

Hello all,

I'm writing two versions of this question. It is motivated by an RDF
case but it is more general so I put it to the ckan list in other
terms. If you look at this graph here,

    http://semantic.ckan.net/group/?group=http://ckan.net/group/lld

you'll see that there is an obvious partition, with a group of three
large datasets floating off to the side. This is not really a 
partition.

Suppose the datasets we are considering are CSV files. Suppose we
have two CSV files and they have a column in common, something that
is useful for doing a join, like an ISBN or an ISSN or whatever. We
might want to say, in CKAN, that these two datasets are connected,
that you can sensibly do a join operation in a certain way in a
certain column. We might use the package relationship thing for this,
or we might adopt a convention for using the extras for it, it doesn't
really matter.

What does matter is when we add a third dataset that also has a column
with the same meaning. Now, how do we express that you can join it
with the others? If we do the same thing with the extras we quickly
get into a situation where we have O(n^2) such links in the number of
datasets. If we make fewer explicit links, users of this data have to
have some way of dealing with transitivity but that only works iff all
the datasets in question have the same entries in that column.

So what I suggest to do for this case is invent a dataset that doesn't
really exist but could in principle (though it's probably not worth
the trouble to create it explicitly and maintain it). This virtual
dataset would have a single column with the same meaning as these 
common columns and it would have the union of all the values for that
column in any of the datasets.

Then each of the datasets would have an extra or relationship or
whatever that says, you can sensibly join column X with this
dataset. A transitive join btween any two real datasets through this
virtual one makes sense because it has all possible values, and we
only have to maintain O(n) extras and a small amount of extra metadata
for the virtual dataset.

Is this an acceptable use of CKAN, to record a dataset that doesn't
actually exist but could in principle for this purpose?

-w

-- 
William Waites                <mailto:ww at styx.org>
http://river.styx.org/ww/        <sip:ww at styx.org>
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45