[ckan-discuss] implied datasets

Tue May 24 00:21:03 BST 2011

Excellent suggestion.

Others can speak on the computer science. I can only speak on how I would
like to be able to use the system as a developer.

I would like to be able to nominate a column as a key and have CKAN do the
leg work.

I feel that as elements such as Google Refine and storage become integrated
to CKAN, being able to join multiple datasets together in the browser would
be a significant boost to the system's usefulness.

Tim McNamara

On 24 May 2011 00:41, William Waites <ww at styx.org> wrote:

> Hello all,
>
> I'm writing two versions of this question. It is motivated by an RDF
> case but it is more general so I put it to the ckan list in other
> terms. If you look at this graph here,
>
>    http://semantic.ckan.net/group/?group=http://ckan.net/group/lld
>
> you'll see that there is an obvious partition, with a group of three
> large datasets floating off to the side. This is not really a
> partition.
>
> Suppose the datasets we are considering are CSV files. Suppose we
> have two CSV files and they have a column in common, something that
> is useful for doing a join, like an ISBN or an ISSN or whatever. We
> might want to say, in CKAN, that these two datasets are connected,
> that you can sensibly do a join operation in a certain way in a
> certain column. We might use the package relationship thing for this,
> or we might adopt a convention for using the extras for it, it doesn't
> really matter.
>
> What does matter is when we add a third dataset that also has a column
> with the same meaning. Now, how do we express that you can join it
> with the others? If we do the same thing with the extras we quickly
> get into a situation where we have O(n^2) such links in the number of
> datasets. If we make fewer explicit links, users of this data have to
> have some way of dealing with transitivity but that only works iff all
> the datasets in question have the same entries in that column.
>
> So what I suggest to do for this case is invent a dataset that doesn't
> really exist but could in principle (though it's probably not worth
> the trouble to create it explicitly and maintain it). This virtual
> dataset would have a single column with the same meaning as these
> common columns and it would have the union of all the values for that
> column in any of the datasets.
>
> Then each of the datasets would have an extra or relationship or
> whatever that says, you can sensibly join column X with this
> dataset. A transitive join btween any two real datasets through this
> virtual one makes sense because it has all possible values, and we
> only have to maintain O(n) extras and a small amount of extra metadata
> for the virtual dataset.
>
> Is this an acceptable use of CKAN, to record a dataset that doesn't
> actually exist but could in principle for this purpose?
>
> -w
>
> --
> William Waites                <mailto:ww at styx.org>
> http://river.styx.org/ww/        <sip:ww at styx.org>
> F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45
>
> _______________________________________________
> ckan-discuss mailing list
> ckan-discuss at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-discuss/attachments/20110524/ba22f972/attachment-0001.htm>