[open-bibliography] Deduplication
Fred Guy
f.guy at ed.ac.uk
Mon Jun 21 08:41:33 UTC 2010
To all,
Just to confirm that the algorithm used in Aleph for SUNCAT is
essentially the one developed for the University of California.
Fred
Karen Coyle wrote:
> Quoting William Waites <william.waites at okfn.org>:
>
>>
>> Hi Karen, this is helpful. It's very similar to
>> the Aleph/SUNCAT algorithm -- in fact it even looks
>> like the weights are chosen the same way (their
>> level 1 threshold is something like 800, yours is
>> 875).
>
> Ex Libris used the U California algorithm. I was working at U Cal at
> that time, and we developed the Aleph merging together with Ex Libris
> in support of the MELVYL database. The idea was to transfer the
> merging that had been done in the home-grown MELVYL system to the
> Aleph one.
>
>
>>
>> It's the "portion of normalised title" that I'm
>> trying to improve upon. Taking the first 25
>> characters seems a bit arbitrary. The result should
>> in any case be the same -- a small number of records
>> to look at in more detail.
>
> It wasn't arbitrary, although YMMV -- when we developed this algorithm
> at UCal we had a test database of about 4,000 items (carefully
> selected) and we modified weights, string lengths, etc., until we got
> as close as we could to the same decisions made by humans. That WAS,
> however, in 1982, and some things will undoubtedly have changed. A
> couple of things about titles:
>
> - we ran into situations where some records had title + subtitle and
> some just had title. We compared the title portion (w/o subtitle) in a
> left-anchored match. For retrieving items into the pool, however, we
> actually created a short title key that we could query against. But
> *efficiency* was more constrained in those days.
>
> - we ended up creating a small set (20-30, as I recall) of "exception"
> titles -- titles that were very long, very regular, but with one word
> different that wasn't a typo. These tended to be government documents:
> "Report of the commission on the development of natural resources,
> subcommittee report from the state of ... Alabama/Alaska/etc." It was
> very hard to avoid mis-merging these works -- they'd have the same
> date, same publisher, same number of pages, and no identifier.
> (Government documents and legal materials are very difficult to match,
> in general.)
>
> - we also ended up creating a list of "titles too short" - Poems,
> Works, etc. These got a lesser weight so we didn't end up merging Ezra
> Pound with e e cummings.
>
> - Because of the way that cataloging handles publisher names, those
> are very difficult to match, therefore they were given a low value in
> the comparison.
>
> In other words, no matter what algorithm you create, you are going to
> find things that can't be correctly identified as "same" or "not same"
> using the algorithm. You will have to decide whether you wish to err
> on the side of over-merging or under-merging. We did the latter
> because our application would have masked the identity of items that
> had merged incorrectly.
>
> kc
>
>>
>> Cheers,
>> -w
>>
>> --
>> William Waites <william.waites at okfn.org>
>> Mob: +44 789 798 9965 Open Knowledge Foundation
>> Fax: +44 131 464 4948 Edinburgh, UK
>>
>> RDF Indexing, Clustering and Inferencing in Python
>> http://ordf.org/
>>
>
>
>
--
Fred Guy
SUNCAT Project Manager
EDINA
Causewayside House
158-162 Causewayside
Edinburgh EH9 1PR
Scotland, UK
Tel: +44 (0) 131 651 3875
Fax: +44 (0)131 650 3308
Email: f.guy at ed.ac.uk
http://edina.ac.uk
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the open-bibliography
mailing list