[open-bibliography] Deduplication

Sun Jun 20 19:18:01 UTC 2010

Quoting William Waites <william.waites at okfn.org>:

>
> Hi Karen, this is helpful. It's very similar to
> the Aleph/SUNCAT algorithm -- in fact it even looks
> like the weights are chosen the same way (their
> level 1 threshold is something like 800, yours is
> 875).

Ex Libris used the U California algorithm. I was working at U Cal at  
that time, and we developed the Aleph merging together with Ex Libris  
in support of the MELVYL database. The idea was to transfer the  
merging that had been done in the home-grown MELVYL system to the  
Aleph one.

>
> It's the "portion of normalised title" that I'm
> trying to improve upon. Taking the first 25
> characters seems a bit arbitrary. The result should
> in any case be the same -- a small number of records
> to look at in more detail.

It wasn't arbitrary, although YMMV -- when we developed this algorithm  
at UCal we had a test database of about 4,000 items (carefully  
selected) and we modified weights, string lengths, etc., until we got  
as close as we could to the same decisions made by humans. That WAS,  
however, in 1982, and some things will undoubtedly have changed. A  
couple of things about titles:

- we ran into situations where some records had title + subtitle and  
some just had title. We compared the title portion (w/o subtitle) in a  
left-anchored match. For retrieving items into the pool, however, we  
actually created a short title key that we could query against. But  
*efficiency* was more constrained in those days.

- we ended up creating a small set (20-30, as I recall) of "exception"  
titles -- titles that were very long, very regular, but with one word  
different that wasn't a typo. These tended to be government documents:
"Report of the commission on the development of natural resources,  
subcommittee report from the state of ... Alabama/Alaska/etc." It was  
very hard to avoid mis-merging these works -- they'd have the same  
date, same publisher, same number of pages, and no identifier.  
(Government documents and legal materials are very difficult to match,  
in general.)

- we also ended up creating a list of "titles too short" - Poems,  
Works, etc. These got a lesser weight so we didn't end up merging Ezra  
Pound with e e cummings.

- Because of the way that cataloging handles publisher names, those  
are very difficult to match, therefore they were given a low value in  
the comparison.

In other words, no matter what algorithm you create, you are going to  
find things that can't be correctly identified as "same" or "not same"  
using the algorithm. You will have to decide whether you wish to err  
on the side of over-merging or under-merging. We did the latter  
because our application would have masked the identity of items that  
had merged incorrectly.

kc

>
> Cheers,
> -w
>
> --
> William Waites           <william.waites at okfn.org>
> Mob: +44 789 798 9965    Open Knowledge Foundation
> Fax: +44 131 464 4948                Edinburgh, UK
>
> RDF Indexing, Clustering and Inferencing in Python
> 		http://ordf.org/
>

-- 
Karen Coyle
kcoyle at kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet