[okfn-labs] automating connection of CKAN to R stats package

Friedrich Lindenberg friedrich.lindenberg at okfn.org
Fri Jul 27 08:02:22 UTC 2012


Hey Martin,

that's a really nice demo! One thing about this that I have thought
about for a while is what you do with "head" in your video: finding
overlapping columns between datasets. Assume you've got a database
like this:

http://opendatalabs.org/misc/ckan-dataset-links.db

This has each resource in CKAN (in fact it doesn't - tried running it
overnight but there are quite lot of errors coming from ES), and, in a
second table, the names of each field in these (if they were in the
DataStore). A final table has the facets, i.e. the 100 most common
values of that field.

How can you make a link recommender? Would you do it on column name,
value overlap, ...? Could this happen in pyBossa or even fully
unsupervised?

Having such a recommender would make both Ronalds happy: it lowers the
cost of finding related data and gives you all the junk links you can
eat....

- Friedrich

On Fri, Jul 27, 2012 at 1:29 AM, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> On 26 July 2012 22:00, Martin Keegan <martin.keegan at okfn.org> wrote:
>> Hello,
>>
>> part of my project exploring automating processing of tabular data has
>> been recorded as a video, which is here:
>> http://mk.ucant.org/media/ckan-to-r.flv; the last three post on my
>> blog give some more details; if you were at the recent OKF staff
>> summit you'll have seen a failed demo of basically what's in the
>> video, which goes out of its way to show it's not being faked - the
>> real work could be typed in in about 20 seconds.
>
> this is fantastic Martin. Data from CKAN -> R -> integrated and
> analyzed in 20s :-)
>
> BTW for those note able to dig out the posts they are:
>
> Project Ronald, an introduction: http://blog.ucant.org/?p=393
>
> Project Ronald, an example: http://blog.ucant.org/?p=414
>
> Quoting from the second of these posts:
>
> <quote>
> The first objective of Project Ronald is to make it easy to connect
> tabular datasets quickly: given two openly-licensed tabular datasets
> containing a common field, but published by different organisations,
> it ought to be possible to get them downloaded and joined together in
> a few seconds. The approach is to identify the components of a system
> which would do this, implement a minimal version of each, check that
> the system works as a whole, and then go about replacing each
> component with better tools, preferably ones already written and
> matured by someone else.
> </quote>
>
> And finally code on github:
>
> https://github.com/mk270/ronald
>
> Rufus
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs




More information about the okfn-labs mailing list