[open-bibliography] Deduplication

Sun Jun 20 13:54:19 UTC 2010

Hello,

Just wanted to give a quick update on this. Copying
the bibliography list and a few possibly interested
people as well. We are working on deduplicating some
records of medical publications for Harvard Medical
School.

The task is, given several library catalogues, to
identify duplicates or potential duplicates. This
is similar to work done by EDINA for SUNCAT a few
years ago, some software called Aleph and others.

Broadly the algorithm looks like this:

  1. compare identifiers where the syntax is known
     and matches imply a high probability of
     identity, e.g. ISBN, ISSN
  2. if probability of a match is over a certain
     threshold, stop.
  3. compare text fields like title and author to
     determin if a match is likely, ascertain a
     probabilty of match

An optimisation of step 3 involves removing some
common or stop-words from the text fields first as
suggested in the report for EDINA by Xinglong Wang.

The comparison in 3 is expensive. As implemented in
Aleph and SUNCAT, you would take take a text string
and compare it to the corresponding field for each
record in the database using a string similarity
metric such as the Levenshtein or Jaro-Winkler
distances. Not only is calculating the string
similarity itself relatively expensive but the number
of such operations grows with the size of the
database.

This has led to a very interesting side-track to
investigate ways of making 3 a cheaper operation.

My current thinking is to try to find a way to map
strings into an n-dimensional space (n being
configurable in the software). This mapping should
be done in such a way that if they are near to each
other (in a euclidean sense) in this space then
their similarities will be great. In this way the
question of "which strings in the database are most
similar to x" becomes "which points are closest to
pos(x)" and we can bring techniques of spatial
search to bear.

How do we map a string into a point in space? First
choose several reference strings, (b1,...,bn) and
then simply take our distance function and apply it,

    pos(x) = (d(x,b1), ..., d(x,bn))

since n is much smaller than the number of strings
in the database, this is a much simpler calculation
and only needs to be done once for each string.

Research questions:

    * How to choose the reference strings? My first
      thought was to generate them randomly. Peter
      Buneman suggests using a technique called
      multidimensional scaling applied to a selection
      of strings in the database and looking for
      clustering. The strings corresponding most
      closely to the centre of mass of a cluster
      might be suitably used as reference strings.

    * How many dimensions? The more dimensions, the
      harder it is to do a spatial search. The fewer
      dimensions the less likely we are to have a
      useful correspondence between spatial distance
      and string distance.

    * Which string similarity metric works best for
      this purpose -- edit distance, Levenshtein
      or Jaro-Winkler metrics?

Hopefully you won't consider these investigations
to be off-track for the deduplication task at hand,
should they prove fruitful it could be a very useful
technique...

Any feedback or suggestions would be most welcome.

Cheers,
-w

-- 
William Waites           <william.waites at okfn.org>
Mob: +44 789 798 9965    Open Knowledge Foundation
Fax: +44 131 464 4948                Edinburgh, UK

RDF Indexing, Clustering and Inferencing in Python
		http://ordf.org/