[open-bibliography] Deduplication

Mon Jun 21 08:41:33 UTC 2010

To all,

Just to confirm that the algorithm used in Aleph for SUNCAT is 
essentially the one developed for the University of California.

Fred

Karen Coyle wrote:
> Quoting William Waites <william.waites at okfn.org>:
>
>>
>> Hi Karen, this is helpful. It's very similar to
>> the Aleph/SUNCAT algorithm -- in fact it even looks
>> like the weights are chosen the same way (their
>> level 1 threshold is something like 800, yours is
>> 875).
>
> Ex Libris used the U California algorithm. I was working at U Cal at 
> that time, and we developed the Aleph merging together with Ex Libris 
> in support of the MELVYL database. The idea was to transfer the 
> merging that had been done in the home-grown MELVYL system to the 
> Aleph one.
>
>
>>
>> It's the "portion of normalised title" that I'm
>> trying to improve upon. Taking the first 25
>> characters seems a bit arbitrary. The result should
>> in any case be the same -- a small number of records
>> to look at in more detail.
>
> It wasn't arbitrary, although YMMV -- when we developed this algorithm 
> at UCal we had a test database of about 4,000 items (carefully 
> selected) and we modified weights, string lengths, etc., until we got 
> as close as we could to the same decisions made by humans. That WAS, 
> however, in 1982, and some things will undoubtedly have changed. A 
> couple of things about titles:
>
> - we ran into situations where some records had title + subtitle and 
> some just had title. We compared the title portion (w/o subtitle) in a 
> left-anchored match. For retrieving items into the pool, however, we 
> actually created a short title key that we could query against. But 
> *efficiency* was more constrained in those days.
>
> - we ended up creating a small set (20-30, as I recall) of "exception" 
> titles -- titles that were very long, very regular, but with one word 
> different that wasn't a typo. These tended to be government documents:
> "Report of the commission on the development of natural resources, 
> subcommittee report from the state of ... Alabama/Alaska/etc." It was 
> very hard to avoid mis-merging these works -- they'd have the same 
> date, same publisher, same number of pages, and no identifier. 
> (Government documents and legal materials are very difficult to match, 
> in general.)
>
> - we also ended up creating a list of "titles too short" - Poems, 
> Works, etc. These got a lesser weight so we didn't end up merging Ezra 
> Pound with e e cummings.
>
> - Because of the way that cataloging handles publisher names, those 
> are very difficult to match, therefore they were given a low value in 
> the comparison.
>
> In other words, no matter what algorithm you create, you are going to 
> find things that can't be correctly identified as "same" or "not same" 
> using the algorithm. You will have to decide whether you wish to err 
> on the side of over-merging or under-merging. We did the latter 
> because our application would have masked the identity of items that 
> had merged incorrectly.
>
> kc
>
>>
>> Cheers,
>> -w
>>
>> -- 
>> William Waites           <william.waites at okfn.org>
>> Mob: +44 789 798 9965    Open Knowledge Foundation
>> Fax: +44 131 464 4948                Edinburgh, UK
>>
>> RDF Indexing, Clustering and Inferencing in Python
>>         http://ordf.org/
>>
>
>
>

-- 
Fred Guy
SUNCAT Project Manager
EDINA
Causewayside House
158-162 Causewayside
Edinburgh EH9 1PR
Scotland, UK
Tel: +44 (0) 131 651 3875
Fax: +44 (0)131 650 3308
Email: f.guy at ed.ac.uk
http://edina.ac.uk

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.