[okfn-labs] automating connection of CKAN to R stats package

Mark Wainwright mark.wainwright at okfn.org
Tue Aug 7 14:09:32 UTC 2012


It's worth pointing out that this question of inferring relations is
just the problem that is addressed, at least in the case of linked
data, by the Silk Link  Discovery Framework
<http://www4.wiwiss.fu-berlin.de/bizer/silk/>.

Anyway, I agree that it'd be good to write up something for the CKAN
blog (Martin, I'll drop you a note).

Mark

On 31 July 2012 15:10, Martin Keegan <martin.keegan at okfn.org> wrote:
> On Fri, Jul 27, 2012 at 9:02 AM, Friedrich Lindenberg
> <friedrich.lindenberg at okfn.org> wrote:
>
>> that's a really nice demo! One thing about this that I have thought
>> about for a while is what you do with "head" in your video: finding
>> overlapping columns between datasets. Assume you've got a database
>> like this:
>
>> http://opendatalabs.org/misc/ckan-dataset-links.db
>>
>> This has each resource in CKAN (in fact it doesn't - tried running it
>> overnight but there are quite lot of errors coming from ES), and, in a
>> second table, the names of each field in these (if they were in the
>> DataStore). A final table has the facets, i.e. the 100 most common
>> values of that field.
>>
>> How can you make a link recommender? Would you do it on column name,
>> value overlap, ...? Could this happen in pyBossa or even fully
>> unsupervised?
>
>> Having such a recommender would make both Ronalds happy: it lowers the
>> cost of finding related data and gives you all the junk links you can
>> eat....
>
> Friedrich, these are the right questions, and thank you for the link
> to the existing work on this.
>
> I have not written it up yet, but I have been working on the
> automated/manual inference of relations between datasets to which you
> refer, with the assumption that an automatic tool could make
> recommendations to be adjudicated manually by pyBossa, which seems to
> be your thinking too; it possible to include a confidence weighting as
> part of the output.
>
> Currently I'm ignoring column names (my day job involves not trusting
> column names, which have more failure modes than I'd care to
> enumerate). When producing a set of linked datasets, deciding on
> tidied-up column names is one of the pieces of work to be done.
> However, public datasets are likely to have better column names than
> the sort of data I normally have to deal with :)
>
> My tool (in the next phase of Project Ronald) looks at what you're
> calling "value overlap": if you take two columns, does either contain
> unique values, and is one set of values a subset of the other? This
> permits a partial ordering over the set of all columns in the set of
> tabular datasets. However, the cost of computing this is (I think)
> MlogM NlogN (for M columns and maximum N rows per column). In
> practice, a lot of columns in M can be ignored, but N might be in the
> tens of millions in some cases (e.g., US Census).
>
> I envisage that there'll be a data annotation role separate from
> uploading/tidying datasets, which is authoritatively saying "the types
> of the data in these columns is X, Y & Z and Y always has to be a
> member of some set A in some other dataset B", in some
> machine-readable way, and that, ultimately, end-user apps will be able
> to use this data not just for retrieving the data they need but for
> linking datasets prior to use/analysis.
>
> Mk
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs



-- 
Mark Wainwright, CKAN Community Co-ordinator
Open Knowledge Foundation http://okfn.org/
CKAN on Twitter: @CKANproject




More information about the okfn-labs mailing list