[okfn-labs] automating connection of CKAN to R stats package

Tue Jul 31 14:10:49 UTC 2012

On Fri, Jul 27, 2012 at 9:02 AM, Friedrich Lindenberg
<friedrich.lindenberg at okfn.org> wrote:

> that's a really nice demo! One thing about this that I have thought
> about for a while is what you do with "head" in your video: finding
> overlapping columns between datasets. Assume you've got a database
> like this:

> http://opendatalabs.org/misc/ckan-dataset-links.db
>
> This has each resource in CKAN (in fact it doesn't - tried running it
> overnight but there are quite lot of errors coming from ES), and, in a
> second table, the names of each field in these (if they were in the
> DataStore). A final table has the facets, i.e. the 100 most common
> values of that field.
>
> How can you make a link recommender? Would you do it on column name,
> value overlap, ...? Could this happen in pyBossa or even fully
> unsupervised?

> Having such a recommender would make both Ronalds happy: it lowers the
> cost of finding related data and gives you all the junk links you can
> eat....

Friedrich, these are the right questions, and thank you for the link
to the existing work on this.

I have not written it up yet, but I have been working on the
automated/manual inference of relations between datasets to which you
refer, with the assumption that an automatic tool could make
recommendations to be adjudicated manually by pyBossa, which seems to
be your thinking too; it possible to include a confidence weighting as
part of the output.

Currently I'm ignoring column names (my day job involves not trusting
column names, which have more failure modes than I'd care to
enumerate). When producing a set of linked datasets, deciding on
tidied-up column names is one of the pieces of work to be done.
However, public datasets are likely to have better column names than
the sort of data I normally have to deal with :)

My tool (in the next phase of Project Ronald) looks at what you're
calling "value overlap": if you take two columns, does either contain
unique values, and is one set of values a subset of the other? This
permits a partial ordering over the set of all columns in the set of
tabular datasets. However, the cost of computing this is (I think)
MlogM NlogN (for M columns and maximum N rows per column). In
practice, a lot of columns in M can be ignored, but N might be in the
tens of millions in some cases (e.g., US Census).

I envisage that there'll be a data annotation role separate from
uploading/tidying datasets, which is authoritatively saying "the types
of the data in these columns is X, Y & Z and Y always has to be a
member of some set A in some other dataset B", in some
machine-readable way, and that, ultimately, end-user apps will be able
to use this data not just for retrieving the data they need but for
linking datasets prior to use/analysis.

Mk