[open-bibliography] Deduplication

Sun Jun 20 16:37:58 UTC 2010

On 10-06-20 15:24, Karen Coyle wrote:
> William, you might want to look at the algorithm that I worked on at
> University of California, and is now being used (undoubtedly in modified
> form) at the Open Library.
>   http://kcoyle.net/merge.html

Hi Karen, this is helpful. It's very similar to
the Aleph/SUNCAT algorithm -- in fact it even looks
like the weights are chosen the same way (their
level 1 threshold is something like 800, yours is
875).

> For efficiency, there is a search step that retrieves possible matches
> on various identifiers (e.g. whatever you've got), and a portion of the
> title. The remainder of matching is done against that "pool" rather than
> the entire database.

It's the "portion of normalised title" that I'm
trying to improve upon. Taking the first 25
characters seems a bit arbitrary. The result should
in any case be the same -- a small number of records
to look at in more detail.

Cheers,
-w

-- 
William Waites           <william.waites at okfn.org>
Mob: +44 789 798 9965    Open Knowledge Foundation
Fax: +44 131 464 4948                Edinburgh, UK

RDF Indexing, Clustering and Inferencing in Python
		http://ordf.org/