[okfn-labs] String Clustoring Fun

Laura James laura.james at okfn.org
Sat May 11 17:06:13 UTC 2013


Nice, David, thanks for sharing!

I'd be interested in the research you've drawn upon for the clustering...

Cheers,

Laura


On 11 May 2013 17:08, David Raznick <david.raznick at okfn.org> wrote:

> Hello
>
> When doing any analysis of messy input data there is normally a stumbling
> block where it is obvious that their are many duplicates in certain
> categories but I never have a decent way to cluster them together.
> Google refine has some tools like this but they slow and does not really
> fit with my workflow.
> Friedrich http://nomenklatura.pudo.org/ poses a solution to this where it
> gives suggestions for a  crowd-sourced effort to match.   I am too
> impatient for that...
>
> So my spare time has been dedicated to finding the quickest algorithms for
> string similarity clustering which bring back mostly "useful results".  If
> anyone is interested in the latest acedemic reaserch to how to do this fast
> then I am happy to bore you with it.
>
> In the spirit of datapipes http://datapipes.okfnlabs.org/ , I have set up
> an very experimental endpoint with the fastest method researched.
>
> Example:
>
> This gist holds the unique uk 25kspend canonical suppliers field as a
> list. There are ~100k in the list.
>
>
> https://gist.github.com/kindly/5560247/raw/b74f2f240d6f870e9701eb4ec8ee04086c54ed1a/suppliers25k
>
> There are two result formats in the service.  "match" which is more human
> readable:
>
>
> http://desolate-thicket-9971.herokuapp.com/cluster?similarity=0.9&format=match&url=https://gist.github.com/kindly/5560247/raw/b74f2f240d6f870e9701eb4ec8ee04086c54ed1a/suppliers25k
>
> A csv file that adds ids to match back against (if you supply a csv with
> more than one column the first will treated as an id and the second the
> field to cluster against)
>
>
> http://desolate-thicket-9971.herokuapp.com/cluster?similarity=0.9&format=table&url=https://gist.github.com/kindly/5560247/raw/b74f2f240d6f870e9701eb4ec8ee04086c54ed1a/suppliers25k
>
> You can also post the data if you like and miss of the url param and there
> is a similarity score to to play around with too. (Set at 90% here dice
> threshold here)
>
> Hope people find this interesting.  My very messy go code for the service
> is
> https://github.com/kindly/datafunc/blob/master/cluster_web.go
>
> Thanks
>
> David
>
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>


-- 
*

Dr Laura James

Co-Director  | skype: laura.james  |  @LaurieJ <https://twitter.com/LaurieJ>

The Open Knowledge Foundation <http://okfn.org/>

Empowering through Open Knowledge
http://okfn.org/  |  @okfn <http://twitter.com/OKFN>  |  OKF on
Facebook<https://www.facebook.com/OKFNetwork> |
Blog <http://blog.okfn.org/>  |  Newsletter<http://okfn.org/about/newsletter>

**

**

*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130511/1c0f399b/attachment-0002.html>


More information about the okfn-labs mailing list