[open-bibliography] Deduplication
Robin Houston
robin.houston at gmail.com
Sun Jun 20 15:03:56 UTC 2010
Hello William,
I don't have much background here, so I apologise if any of this is wildly
inappropriate.
This deduplication problem sounds pretty gnarly to me even without
introducing any speculative new ideas. Is there a good reason not to attempt
some of the more tried-and-tested approaches first? Of course there's an
enormous literature on text search and indexing, and ideally you'd need an
expert to point out the relevant parts (unless you are such an expert
yourself, in which case apologies!)
But a couple of non-expert remarks:
– the efficiency of step 3 could be hugely improved (from O(n^2) to O(n log
n)) just by indexing the text fields, even if you do it in a simple way like
dumping them all into a trie. It's not obvious to me (perhaps because I'm
being stupid) that your proposed algorithm would be essentially any better
than this, even if it could be made to work.
– bioinformaticians have a lot of experience of doing inexact string
searches of short strings against large databases, in the context of mapping
short DNA reads to a reference genome. The best algorithms (at least as of
last year) use the Burrows-Wheeler transform to create a compressed index,
with the effect that the whole index of the human genome can be held in RAM
on a non-ridiculous machine. Of course keeping the index in main memory
results in massive speed improvements. Whether such a strategy is practical
or useful in your case depends on both the size of the index and how fast
the algorithm needs to run. http://dx.doi.org/10.1093/bioinformatics/btp324
Happy to say more about any of this if it's at all useful.
All the best,
Robin
On 20 June 2010 14:54, William Waites <william.waites at okfn.org> wrote:
> Hello,
>
> Just wanted to give a quick update on this. Copying
> the bibliography list and a few possibly interested
> people as well. We are working on deduplicating some
> records of medical publications for Harvard Medical
> School.
>
> The task is, given several library catalogues, to
> identify duplicates or potential duplicates. This
> is similar to work done by EDINA for SUNCAT a few
> years ago, some software called Aleph and others.
>
> Broadly the algorithm looks like this:
>
> 1. compare identifiers where the syntax is known
> and matches imply a high probability of
> identity, e.g. ISBN, ISSN
> 2. if probability of a match is over a certain
> threshold, stop.
> 3. compare text fields like title and author to
> determin if a match is likely, ascertain a
> probabilty of match
>
> An optimisation of step 3 involves removing some
> common or stop-words from the text fields first as
> suggested in the report for EDINA by Xinglong Wang.
>
> The comparison in 3 is expensive. As implemented in
> Aleph and SUNCAT, you would take take a text string
> and compare it to the corresponding field for each
> record in the database using a string similarity
> metric such as the Levenshtein or Jaro-Winkler
> distances. Not only is calculating the string
> similarity itself relatively expensive but the number
> of such operations grows with the size of the
> database.
>
> This has led to a very interesting side-track to
> investigate ways of making 3 a cheaper operation.
>
> My current thinking is to try to find a way to map
> strings into an n-dimensional space (n being
> configurable in the software). This mapping should
> be done in such a way that if they are near to each
> other (in a euclidean sense) in this space then
> their similarities will be great. In this way the
> question of "which strings in the database are most
> similar to x" becomes "which points are closest to
> pos(x)" and we can bring techniques of spatial
> search to bear.
>
> How do we map a string into a point in space? First
> choose several reference strings, (b1,...,bn) and
> then simply take our distance function and apply it,
>
> pos(x) = (d(x,b1), ..., d(x,bn))
>
> since n is much smaller than the number of strings
> in the database, this is a much simpler calculation
> and only needs to be done once for each string.
>
> Research questions:
>
> * How to choose the reference strings? My first
> thought was to generate them randomly. Peter
> Buneman suggests using a technique called
> multidimensional scaling applied to a selection
> of strings in the database and looking for
> clustering. The strings corresponding most
> closely to the centre of mass of a cluster
> might be suitably used as reference strings.
>
> * How many dimensions? The more dimensions, the
> harder it is to do a spatial search. The fewer
> dimensions the less likely we are to have a
> useful correspondence between spatial distance
> and string distance.
>
> * Which string similarity metric works best for
> this purpose -- edit distance, Levenshtein
> or Jaro-Winkler metrics?
>
> Hopefully you won't consider these investigations
> to be off-track for the deduplication task at hand,
> should they prove fruitful it could be a very useful
> technique...
>
> Any feedback or suggestions would be most welcome.
>
> Cheers,
> -w
>
> --
> William Waites <william.waites at okfn.org>
> Mob: +44 789 798 9965 Open Knowledge Foundation
> Fax: +44 131 464 4948 Edinburgh, UK
>
> RDF Indexing, Clustering and Inferencing in Python
> http://ordf.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-bibliography/attachments/20100620/4c5c5ff3/attachment-0001.html>
More information about the open-bibliography
mailing list