[openspending-dev] OpenSpending

Lucy Chambers lucy.chambers at okfn.org
Fri May 3 21:22:58 UTC 2013


Hi All,

Just a note that I'm sat here with Derek Eder who's behind the great
http://lookatcook.com/ project.  Started this thread about possibly
integrating some of the visualisations that these guys use onto the
OpenSpending api.

Just wanted to re-kindle the conversation!

Lucy

On Tuesday, 9 October 2012, Derek Eder wrote:

> Hi Friedrich,
>
> We've got a couple of really interesting threads here!
>
> *Look at Cook & OpenSpending*
> I still have some research to do as to the feasibility of swapping out
> Google Fusion Tables and replacing it with the OpenSpending API (by the
> way, noticed a few dead links: http://openspending.org/data-loading.htmland
> http://openspending.org/settings). In the meantime, there's plenty to
> discuss with dedupe and CKAN.
>
> *Dedupe & CKAN*
> Dedupe <https://github.com/open-city/dedupe> is a python library created
> by Forest Gregg (he's lurking on this list I believe) and me that uses
> machine learning to create optimum blocking rules and weights for
> de-duplicating (aka entity resolution or record linkage) any given dataset.
>
> The library is built as an API, with one of the key pieces required being
> a user labeling function. This is a method for a user to flag a small
> number record pairs chosen by dedupe as being duplicates or not. Once
> enough training data is entered, weights and blocking rules are applied to
> the rest of the dataset, and a collection of clustered records (a cluster
> being a group of records that dedupe believes refer to the same entity) is
> returned.
>
> We currently have an example script<https://github.com/open-city/dedupe/blob/master/examples/csv_example.py>that demonstrates all of this by providing a CSV file and a labeling
> function that uses the command line and 'y' or 'n' input for labeling. The
> final result is another CSV file containing the original data organized in
> to clusters.
>
> Our next task is to create an example script that uses a sqlite database
> for dealing with a much larger dataset (Campaign Contributions in IL ~1.7
> million records). The release of this script and the subsequent changes to
> the core library will mark an alpha release of dedupe.
>
> One thing that I would like to point out, is this script does not handle
> the problem of canonicalization (yet). Manually cross-checking is a
> possibility, but would definitely run in to some scaling problems.
> Regardless, the dedupe library may still be of use to someone working with
> a more manageable dataset.
>
> More info:
>
>    - Github repo <https://github.com/open-city/dedupe>
>    - Dedupe presentation <http://pyvideo.org/video/973/big-data-de-duping>(
>    slides<https://docs.google.com/presentation/d/1i2q8BP9OvH8G_okrJtdw9u8XVc7EAEkdk_nIRCLxxU8/edit#slide=id.g11c3aeda_0_0>
>    )
>    - Dedupe Google group<https://groups.google.com/forum/?fromgroups#!forum/open-source-deduplication>
>
>
> Based on the above, do you think dedupe and Nomenklatura could be
> integrated? And if not, would dedupe still be of use as a CKAN plugin?
>
> Derek
>
> On Fri, Oct 5, 2012 at 9:21 AM, Friedrich Lindenberg <
> friedrich.lindenberg at okfn.org <javascript:_e({}, 'cvml',
> 'friedrich.lindenberg at okfn.org');>> wrote:
>
>> Hey Derek,
>>
>> (re-sending with list)
>>
>> nice to meet you. Short intro: I'm working on OpenSpending.org and
>> would be more than happy to support you with your efforts to use our
>> platform for the (brilliant!) Cook project - either via Mail or in
>> Skype/IRC.
>>
>> I'm also very keen to learn more about dedupe (and would love to find
>> out how it can integrate with nomenklatura, which is just something we
>> hacked up to get our ETL processes a bit smoother). Am I right in
>> assuming that dedupe could be used as a library in nomenklatura to
>> offer more advanced clustering of candidates? What exactly is required
>> for training a dataset? I've looked at the sample but I'm not clear on
>> how this works to compare several fields and how one would deal with
>> the outputs (would you still require manual cross-checking?).
>>
>> Cheers,
>>
>>  - Friedrich
>>
>> On Thu, Oct 4, 2012 at 9:43 PM, Derek Eder <derek.eder at gmail.com<javascript:_e({}, 'cvml', 'derek.eder at gmail.com');>>
>> wrote:
>> > Rufus,
>> >
>> > Good to meet you this week at the CfA Summit.
>> >
>> > I'll definitely look in to plugging LookAtCook.com in to the
>> OpenSpending
>> > back end. I'd like to keep the front end working as it is, as its pretty
>> > intuitive and you can deep link to each view. The code is on github if
>> > you're interested: https://github.com/open-city/look-at-cook
>> >
>> > I think I also became a CKAN convert yesterday. Going to look in to
>> standing
>> > up an instance here in Chicago. Also, here's the dedupe python library I
>> > mentioned: https://github.com/open-city/dedupe. I think it would be
>> pretty
>> > powerful to plug in to CKAN and/or nomenklatura.
>> >
>> > Derek
>> >
>> >
>> > On Wed, Oct 3, 2012 at 12:13 PM, Rufus Pollock <rufus.pollock at okfn.org<javascript:_e({}, 'cvml', 'rufus.pollock at okfn.org');>
>> >
>> > wrote:
>> >>
>> >> Just ping re lookatcook.com and connecting re openspending
>> >
>> >
>> >
>> >
>> > --
>> > Derek Eder
>> > @derek_eder
>> > derekeder.com
>> > derek.eder at gmail.com <javascript:_e({}, 'cvml',
>> 'derek.eder at gmail.com');>
>> >
>>
>>
>>
>> --
>> Friedrich Lindenberg <friedrich.lindenberg at okfn.org <javascript:_e({},
>> 'cvml', 'friedrich.lindenberg at okfn.org');>>
>>
>> OpenSpending - http://openspending.org
>> OKFN Labs - http://okfnlabs.org
>>
>> Twitter: @pudo
>> Web: http://pudo.org
>>
>
>
>
> --
> Derek Eder
> @derek_eder <https://twitter.com/#!/derek_eder>
> derekeder.com
> derek.eder at gmail.com <javascript:_e({}, 'cvml', 'derek.eder at gmail.com');>
>
>

-- 
*Project Coordinator*
School of Data <http://schoolofdata.org/> and
OpenSpending <http://openspending.org/>
Projects of the Open Knowledge Foundation <http://okfn.org/>
Support our work <http://okfn.org/support/>.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20130503/c2148634/attachment.htm>


More information about the openspending-dev mailing list