[okfn-labs] String Clustoring Fun

David Raznick david.raznick at okfn.org
Sat May 11 16:08:04 UTC 2013


Hello

When doing any analysis of messy input data there is normally a stumbling
block where it is obvious that their are many duplicates in certain
categories but I never have a decent way to cluster them together.
Google refine has some tools like this but they slow and does not really
fit with my workflow.
Friedrich http://nomenklatura.pudo.org/ poses a solution to this where it
gives suggestions for a  crowd-sourced effort to match.   I am too
impatient for that...

So my spare time has been dedicated to finding the quickest algorithms for
string similarity clustering which bring back mostly "useful results".  If
anyone is interested in the latest acedemic reaserch to how to do this fast
then I am happy to bore you with it.

In the spirit of datapipes http://datapipes.okfnlabs.org/ , I have set up
an very experimental endpoint with the fastest method researched.

Example:

This gist holds the unique uk 25kspend canonical suppliers field as a list.
There are ~100k in the list.

https://gist.github.com/kindly/5560247/raw/b74f2f240d6f870e9701eb4ec8ee04086c54ed1a/suppliers25k

There are two result formats in the service.  "match" which is more human
readable:

http://desolate-thicket-9971.herokuapp.com/cluster?similarity=0.9&format=match&url=https://gist.github.com/kindly/5560247/raw/b74f2f240d6f870e9701eb4ec8ee04086c54ed1a/suppliers25k

A csv file that adds ids to match back against (if you supply a csv with
more than one column the first will treated as an id and the second the
field to cluster against)

http://desolate-thicket-9971.herokuapp.com/cluster?similarity=0.9&format=table&url=https://gist.github.com/kindly/5560247/raw/b74f2f240d6f870e9701eb4ec8ee04086c54ed1a/suppliers25k

You can also post the data if you like and miss of the url param and there
is a similarity score to to play around with too. (Set at 90% here dice
threshold here)

Hope people find this interesting.  My very messy go code for the service
is
https://github.com/kindly/datafunc/blob/master/cluster_web.go

Thanks

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130511/6de4a65a/attachment-0001.html>


More information about the okfn-labs mailing list