[okfn-labs] OpenSpending

Derek Eder derek.eder at gmail.com
Tue Oct 9 22:01:24 UTC 2012


Hi Friedrich,

We've got a couple of really interesting threads here!

*Look at Cook & OpenSpending*
I still have some research to do as to the feasibility of swapping out
Google Fusion Tables and replacing it with the OpenSpending API (by the
way, noticed a few dead links: http://openspending.org/data-loading.htmland
http://openspending.org/settings). In the meantime, there's plenty to
discuss with dedupe and CKAN.

*Dedupe & CKAN*
Dedupe <https://github.com/open-city/dedupe> is a python library created by
Forest Gregg (he's lurking on this list I believe) and me that uses machine
learning to create optimum blocking rules and weights for de-duplicating
(aka entity resolution or record linkage) any given dataset.

The library is built as an API, with one of the key pieces required being a
user labeling function. This is a method for a user to flag a small number
record pairs chosen by dedupe as being duplicates or not. Once enough
training data is entered, weights and blocking rules are applied to the
rest of the dataset, and a collection of clustered records (a cluster being
a group of records that dedupe believes refer to the same entity) is
returned.

We currently have an example
script<https://github.com/open-city/dedupe/blob/master/examples/csv_example.py>that
demonstrates all of this by providing a CSV file and a labeling
function that uses the command line and 'y' or 'n' input for labeling. The
final result is another CSV file containing the original data organized in
to clusters.

Our next task is to create an example script that uses a sqlite database
for dealing with a much larger dataset (Campaign Contributions in IL ~1.7
million records). The release of this script and the subsequent changes to
the core library will mark an alpha release of dedupe.

One thing that I would like to point out, is this script does not handle
the problem of canonicalization (yet). Manually cross-checking is a
possibility, but would definitely run in to some scaling problems.
Regardless, the dedupe library may still be of use to someone working with
a more manageable dataset.

More info:

   - Github repo <https://github.com/open-city/dedupe>
   - Dedupe presentation <http://pyvideo.org/video/973/big-data-de-duping> (
   slides<https://docs.google.com/presentation/d/1i2q8BP9OvH8G_okrJtdw9u8XVc7EAEkdk_nIRCLxxU8/edit#slide=id.g11c3aeda_0_0>
   )
   - Dedupe Google
group<https://groups.google.com/forum/?fromgroups#!forum/open-source-deduplication>


Based on the above, do you think dedupe and Nomenklatura could be
integrated? And if not, would dedupe still be of use as a CKAN plugin?

Derek

On Fri, Oct 5, 2012 at 9:21 AM, Friedrich Lindenberg <
friedrich.lindenberg at okfn.org> wrote:

> Hey Derek,
>
> (re-sending with list)
>
> nice to meet you. Short intro: I'm working on OpenSpending.org and
> would be more than happy to support you with your efforts to use our
> platform for the (brilliant!) Cook project - either via Mail or in
> Skype/IRC.
>
> I'm also very keen to learn more about dedupe (and would love to find
> out how it can integrate with nomenklatura, which is just something we
> hacked up to get our ETL processes a bit smoother). Am I right in
> assuming that dedupe could be used as a library in nomenklatura to
> offer more advanced clustering of candidates? What exactly is required
> for training a dataset? I've looked at the sample but I'm not clear on
> how this works to compare several fields and how one would deal with
> the outputs (would you still require manual cross-checking?).
>
> Cheers,
>
>  - Friedrich
>
> On Thu, Oct 4, 2012 at 9:43 PM, Derek Eder <derek.eder at gmail.com> wrote:
> > Rufus,
> >
> > Good to meet you this week at the CfA Summit.
> >
> > I'll definitely look in to plugging LookAtCook.com in to the OpenSpending
> > back end. I'd like to keep the front end working as it is, as its pretty
> > intuitive and you can deep link to each view. The code is on github if
> > you're interested: https://github.com/open-city/look-at-cook
> >
> > I think I also became a CKAN convert yesterday. Going to look in to
> standing
> > up an instance here in Chicago. Also, here's the dedupe python library I
> > mentioned: https://github.com/open-city/dedupe. I think it would be
> pretty
> > powerful to plug in to CKAN and/or nomenklatura.
> >
> > Derek
> >
> >
> > On Wed, Oct 3, 2012 at 12:13 PM, Rufus Pollock <rufus.pollock at okfn.org>
> > wrote:
> >>
> >> Just ping re lookatcook.com and connecting re openspending
> >
> >
> >
> >
> > --
> > Derek Eder
> > @derek_eder
> > derekeder.com
> > derek.eder at gmail.com
> >
>
>
>
> --
> Friedrich Lindenberg <friedrich.lindenberg at okfn.org>
>
> OpenSpending - http://openspending.org
> OKFN Labs - http://okfnlabs.org
>
> Twitter: @pudo
> Web: http://pudo.org
>



-- 
Derek Eder
@derek_eder <https://twitter.com/#!/derek_eder>
derekeder.com
derek.eder at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20121009/6d7f24cf/attachment-0002.html>


More information about the okfn-labs mailing list