[open-bibliography] Disambiguation, deduplication and 'ideals'
Benjamin O'Steen
bosteen at gmail.com
Wed Sep 1 09:15:13 UTC 2010
On Tue, 2010-08-31 at 19:18 -0700, Jim Pitman wrote:
> I'd like to hear more about your meshing tool. I've made several
> naive starts at this problem in simple cases, especially in aggregating personal
> bibliographies (deduplicating bibitems) and in deduplicating lists of authors.
> A simple tool and adequate data framework and UI for these tasks would be most welcome.
> --Jim
I've been aiming to tackle this in a map-reduce style - once set of
processes flag similarities and the other set 'reduce' and create the
bundle-closure structure. This works in practice and so far, not a
concern to me.
I've just written up a description of what happens at the 'reduce' stage
- what actually happens when things are found to be similar/'same'.
The map stage is the one I am heavily working on. I, too, initially
tried some naive approaches, initially attempting a weighted similarity
using levenshtein distances, hoping to hit the 80:20 sweet spot, but I
was getting nowhere near decent levels of hits.
I started to try other string similarity metrics, including one that
split the fields into tokens before before similarity matching and
balancing for omissions/additions.
The short story is that this whole area is not a new one, and I stopped
trying to re-invent wheels and just accept that there is a reason why
people use the more complex methods.
(Good run down of string similarity metrics here:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
)
So, I am currently adapting a library called Febrl for this purpose -
(Freely extensible biomedical record linkage) - which uses a combination
of these string similarity metrics in a Fellegi-Sunter approach to
dedupe.
http://datamining.anu.edu.au/projects/linkage.html
Ben
PS there is a useful paper surveying effectiveness of various string
similarity techniques here:
http://www.isi.edu/info-agents/workshops/ijcai03/papers/Cohen-p.pdf
Worth reading.
More information about the open-bibliography
mailing list