[open-bibliography] Disambiguation, deduplication and 'ideals'

Wed Sep 1 09:15:13 UTC 2010

On Tue, 2010-08-31 at 19:18 -0700, Jim Pitman wrote:
> I'd like to hear more about your meshing tool. I've made several 
> naive starts at this problem in simple cases, especially in aggregating personal
> bibliographies (deduplicating bibitems) and in deduplicating lists of authors.
> A simple tool and adequate data framework and UI for these tasks would be most welcome.
> --Jim

I've been aiming to tackle this in a map-reduce style - once set of
processes flag similarities and the other set 'reduce' and create the
bundle-closure structure. This works in practice and so far, not a
concern to me.

I've just written up a description of what happens at the 'reduce' stage
- what actually happens when things are found to be similar/'same'.

The map stage is the one I am heavily working on. I, too, initially
tried some naive approaches, initially attempting a weighted similarity
using levenshtein distances, hoping to hit the 80:20 sweet spot, but I
was getting nowhere near decent levels of hits. 

I started to try other string similarity metrics, including one that
split the fields into tokens before before similarity matching and
balancing for omissions/additions. 

The short story is that this whole area is not a new one, and I stopped
trying to re-invent wheels and just accept that there is a reason why
people use the more complex methods.

(Good run down of string similarity metrics here:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
)

So, I am currently adapting a library called Febrl for this purpose -
(Freely extensible biomedical record linkage) - which uses a combination
of these string similarity metrics in a Fellegi-Sunter approach to
dedupe. 

http://datamining.anu.edu.au/projects/linkage.html

Ben

PS there is a useful paper surveying effectiveness of various string
similarity techniques here:

http://www.isi.edu/info-agents/workshops/ijcai03/papers/Cohen-p.pdf

Worth reading.