[open-bibliography] Deduplication
William Waites
william.waites at okfn.org
Sun Jun 20 13:54:19 UTC 2010
Hello,
Just wanted to give a quick update on this. Copying
the bibliography list and a few possibly interested
people as well. We are working on deduplicating some
records of medical publications for Harvard Medical
School.
The task is, given several library catalogues, to
identify duplicates or potential duplicates. This
is similar to work done by EDINA for SUNCAT a few
years ago, some software called Aleph and others.
Broadly the algorithm looks like this:
1. compare identifiers where the syntax is known
and matches imply a high probability of
identity, e.g. ISBN, ISSN
2. if probability of a match is over a certain
threshold, stop.
3. compare text fields like title and author to
determin if a match is likely, ascertain a
probabilty of match
An optimisation of step 3 involves removing some
common or stop-words from the text fields first as
suggested in the report for EDINA by Xinglong Wang.
The comparison in 3 is expensive. As implemented in
Aleph and SUNCAT, you would take take a text string
and compare it to the corresponding field for each
record in the database using a string similarity
metric such as the Levenshtein or Jaro-Winkler
distances. Not only is calculating the string
similarity itself relatively expensive but the number
of such operations grows with the size of the
database.
This has led to a very interesting side-track to
investigate ways of making 3 a cheaper operation.
My current thinking is to try to find a way to map
strings into an n-dimensional space (n being
configurable in the software). This mapping should
be done in such a way that if they are near to each
other (in a euclidean sense) in this space then
their similarities will be great. In this way the
question of "which strings in the database are most
similar to x" becomes "which points are closest to
pos(x)" and we can bring techniques of spatial
search to bear.
How do we map a string into a point in space? First
choose several reference strings, (b1,...,bn) and
then simply take our distance function and apply it,
pos(x) = (d(x,b1), ..., d(x,bn))
since n is much smaller than the number of strings
in the database, this is a much simpler calculation
and only needs to be done once for each string.
Research questions:
* How to choose the reference strings? My first
thought was to generate them randomly. Peter
Buneman suggests using a technique called
multidimensional scaling applied to a selection
of strings in the database and looking for
clustering. The strings corresponding most
closely to the centre of mass of a cluster
might be suitably used as reference strings.
* How many dimensions? The more dimensions, the
harder it is to do a spatial search. The fewer
dimensions the less likely we are to have a
useful correspondence between spatial distance
and string distance.
* Which string similarity metric works best for
this purpose -- edit distance, Levenshtein
or Jaro-Winkler metrics?
Hopefully you won't consider these investigations
to be off-track for the deduplication task at hand,
should they prove fruitful it could be a very useful
technique...
Any feedback or suggestions would be most welcome.
Cheers,
-w
--
William Waites <william.waites at okfn.org>
Mob: +44 789 798 9965 Open Knowledge Foundation
Fax: +44 131 464 4948 Edinburgh, UK
RDF Indexing, Clustering and Inferencing in Python
http://ordf.org/
More information about the open-bibliography
mailing list