[open-bibliography] Deduplication
Karen Coyle
kcoyle at kcoyle.net
Sun Jun 20 14:24:26 UTC 2010
William, you might want to look at the algorithm that I worked on at
University of California, and is now being used (undoubtedly in
modified form) at the Open Library.
http://kcoyle.net/merge.html
For efficiency, there is a search step that retrieves possible matches
on various identifiers (e.g. whatever you've got), and a portion of
the title. The remainder of matching is done against that "pool"
rather than the entire database.
Some things to keep in mind: identifiers can be typo'd or assigned to
the wrong item, so we added a bit of the title and a date to the
identifier match to guard against this.
Also, note that there are degrees of matching, like dates +- 2, that
help make up for common errors in the data.
And... keep in mind that this was used on library data only, so if you
have data from other sources, some things might need to change.
Somewhere in the OL open source you should be able to find the code
that is being used in that project.
kc
Quoting William Waites <william.waites at okfn.org>:
> Hello,
>
> Just wanted to give a quick update on this. Copying
> the bibliography list and a few possibly interested
> people as well. We are working on deduplicating some
> records of medical publications for Harvard Medical
> School.
>
> The task is, given several library catalogues, to
> identify duplicates or potential duplicates. This
> is similar to work done by EDINA for SUNCAT a few
> years ago, some software called Aleph and others.
>
> Broadly the algorithm looks like this:
>
> 1. compare identifiers where the syntax is known
> and matches imply a high probability of
> identity, e.g. ISBN, ISSN
> 2. if probability of a match is over a certain
> threshold, stop.
> 3. compare text fields like title and author to
> determin if a match is likely, ascertain a
> probabilty of match
>
> An optimisation of step 3 involves removing some
> common or stop-words from the text fields first as
> suggested in the report for EDINA by Xinglong Wang.
>
> The comparison in 3 is expensive. As implemented in
> Aleph and SUNCAT, you would take take a text string
> and compare it to the corresponding field for each
> record in the database using a string similarity
> metric such as the Levenshtein or Jaro-Winkler
> distances. Not only is calculating the string
> similarity itself relatively expensive but the number
> of such operations grows with the size of the
> database.
>
> This has led to a very interesting side-track to
> investigate ways of making 3 a cheaper operation.
>
> My current thinking is to try to find a way to map
> strings into an n-dimensional space (n being
> configurable in the software). This mapping should
> be done in such a way that if they are near to each
> other (in a euclidean sense) in this space then
> their similarities will be great. In this way the
> question of "which strings in the database are most
> similar to x" becomes "which points are closest to
> pos(x)" and we can bring techniques of spatial
> search to bear.
>
> How do we map a string into a point in space? First
> choose several reference strings, (b1,...,bn) and
> then simply take our distance function and apply it,
>
> pos(x) = (d(x,b1), ..., d(x,bn))
>
> since n is much smaller than the number of strings
> in the database, this is a much simpler calculation
> and only needs to be done once for each string.
>
> Research questions:
>
> * How to choose the reference strings? My first
> thought was to generate them randomly. Peter
> Buneman suggests using a technique called
> multidimensional scaling applied to a selection
> of strings in the database and looking for
> clustering. The strings corresponding most
> closely to the centre of mass of a cluster
> might be suitably used as reference strings.
>
> * How many dimensions? The more dimensions, the
> harder it is to do a spatial search. The fewer
> dimensions the less likely we are to have a
> useful correspondence between spatial distance
> and string distance.
>
> * Which string similarity metric works best for
> this purpose -- edit distance, Levenshtein
> or Jaro-Winkler metrics?
>
> Hopefully you won't consider these investigations
> to be off-track for the deduplication task at hand,
> should they prove fruitful it could be a very useful
> technique...
>
> Any feedback or suggestions would be most welcome.
>
> Cheers,
> -w
>
> --
> William Waites <william.waites at okfn.org>
> Mob: +44 789 798 9965 Open Knowledge Foundation
> Fax: +44 131 464 4948 Edinburgh, UK
>
> RDF Indexing, Clustering and Inferencing in Python
> http://ordf.org/
>
> _______________________________________________
> open-bibliography mailing list
> open-bibliography at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-bibliography
>
--
Karen Coyle
kcoyle at kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet
More information about the open-bibliography
mailing list