[okfn-labs] String Clustoring Fun

David Raznick david.raznick at okfn.org
Sat May 11 17:59:13 UTC 2013


On Sat, May 11, 2013 at 6:02 PM, Tom Morris <tfmorris at gmail.com> wrote:

> On Sat, May 11, 2013 at 12:08 PM, David Raznick <david.raznick at okfn.org>wrote:
>
>>
>> Google refine has some tools like this but they slow and does not really
>> fit with my workflow.
>>
>
> Do you have performance figures on how your algorithm compares to the
> various Refine algorithms?  We're always looking to improve.
>

Sadly I have not done any especially as my results are not comparable to
the ones in refine.

They based on some simple normalizing splitting up the strings into
trigrams and using the dice similarity measure 90% (this is selectable in
the webservice).  I have made it a greedy algorithm so the shortest string
will be the first to match anything above that threshold and is why I chose
Dice over Jaccard similarity due to it being more symmetrical, even though
its slower.  A greedy algorithm such as this is not perfect as its not
perfectly associative but qualitatively the results seem good.

 It is most based on these two papers, but simplified parts that did not
really make a big enough performance difference.

[1] http://www.cse.unsw.edu.au/~weiw/files/TODS-PPJoin-Final.pdf
[2] http://dbgroup.cs.tsinghua.edu.cn/ligl/papers/sigmod2012-adapt.pdf

They where adjusted based on certain performance characteristics of go and
basically tweaking looking at profiles.  I may try and comment the code to
explain the compromises.  It clusters the 100,000 supplier names in around
10s on my laptop which is good enough for most uses.  I have not made an
adaptive algorithm based on input  length yet, but that is an option.


>
>
>> So my spare time has been dedicated to finding the quickest algorithms
>> for string similarity clustering which bring back mostly "useful results".
>> If anyone is interested in the latest acedemic reaserch to how to do this
>> fast then I am happy to bore you with it.
>>
>
> Rather than boring us, perhaps you could just provide a link to the
> research bibliography along with your definition of "quickest" and "mostly
> 'useful results'".  That would help folks determine whether their metrics
> are aligned with those that you used in the evaluation.
>
>
>>  I have set up an very experimental endpoint with the fastest method
>> researched.
>>
>
> That method being ... ?
>
You can look at the code, even though its a bit messy for now.

>
> Tom
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130511/a97be8a8/attachment-0002.html>


More information about the okfn-labs mailing list