[open-bibliography] Deduplication
Karen Coyle
kcoyle at kcoyle.net
Sun Jun 20 19:18:01 UTC 2010
Quoting William Waites <william.waites at okfn.org>:
>
> Hi Karen, this is helpful. It's very similar to
> the Aleph/SUNCAT algorithm -- in fact it even looks
> like the weights are chosen the same way (their
> level 1 threshold is something like 800, yours is
> 875).
Ex Libris used the U California algorithm. I was working at U Cal at
that time, and we developed the Aleph merging together with Ex Libris
in support of the MELVYL database. The idea was to transfer the
merging that had been done in the home-grown MELVYL system to the
Aleph one.
>
> It's the "portion of normalised title" that I'm
> trying to improve upon. Taking the first 25
> characters seems a bit arbitrary. The result should
> in any case be the same -- a small number of records
> to look at in more detail.
It wasn't arbitrary, although YMMV -- when we developed this algorithm
at UCal we had a test database of about 4,000 items (carefully
selected) and we modified weights, string lengths, etc., until we got
as close as we could to the same decisions made by humans. That WAS,
however, in 1982, and some things will undoubtedly have changed. A
couple of things about titles:
- we ran into situations where some records had title + subtitle and
some just had title. We compared the title portion (w/o subtitle) in a
left-anchored match. For retrieving items into the pool, however, we
actually created a short title key that we could query against. But
*efficiency* was more constrained in those days.
- we ended up creating a small set (20-30, as I recall) of "exception"
titles -- titles that were very long, very regular, but with one word
different that wasn't a typo. These tended to be government documents:
"Report of the commission on the development of natural resources,
subcommittee report from the state of ... Alabama/Alaska/etc." It was
very hard to avoid mis-merging these works -- they'd have the same
date, same publisher, same number of pages, and no identifier.
(Government documents and legal materials are very difficult to match,
in general.)
- we also ended up creating a list of "titles too short" - Poems,
Works, etc. These got a lesser weight so we didn't end up merging Ezra
Pound with e e cummings.
- Because of the way that cataloging handles publisher names, those
are very difficult to match, therefore they were given a low value in
the comparison.
In other words, no matter what algorithm you create, you are going to
find things that can't be correctly identified as "same" or "not same"
using the algorithm. You will have to decide whether you wish to err
on the side of over-merging or under-merging. We did the latter
because our application would have masked the identity of items that
had merged incorrectly.
kc
>
> Cheers,
> -w
>
> --
> William Waites <william.waites at okfn.org>
> Mob: +44 789 798 9965 Open Knowledge Foundation
> Fax: +44 131 464 4948 Edinburgh, UK
>
> RDF Indexing, Clustering and Inferencing in Python
> http://ordf.org/
>
--
Karen Coyle
kcoyle at kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet
More information about the open-bibliography
mailing list