[open-bibliography] Deduplication

Ben O'Steen bosteen at gmail.com
Mon Jun 21 10:51:00 UTC 2010


On Sun, 2010-06-20 at 16:03 +0100, Robin Houston wrote:
> 
> – the efficiency of step 3 could be hugely improved (from O(n^2) to
> O(n log n)) just by indexing the text fields, even if you do it in a
> simple way like dumping them all into a trie.


There is a lot of fast, reusable and - most importantly - already
written code in the lucene project that might help out here. I've had
results by using the MoreLikeThis class of query for example -
http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ -
but I haven't yet experimented with term-weighting ('boosting') with
that approach.

This is a thorny issue as (in my opinion) we aren't looking simply for
mis-types and fuzzed text, we are looking through data entered in by
people who have slightly different ideas of what should be entered in
each field for a given item. Karen has already highlighted the type of
error that typically comes from a constraint in the form ("title -
subtitle vs title" in the "title" field), but there are other errors,
especially when it comes to proper nouns and names. I have seen various
journal abbreviations appear in catalogue metadata, and to collapse
these is a problem.


As for managing the accuracy of the curated data, I would suggest paying
very close attention (and therefore recording in a data structure) the
route that the information took from source to datastore and how records
and entities were merged. I favour the route described by a Southampton
research group for managing co-reference (in the semantic web) -
http://eprints.ecs.soton.ac.uk/15245/ - essentially, every mention of an
entity in a record is given a unique id, and the decision to say that
one id is the same as another is recorded in a 'bundle', with
appropriate metadata. 

I'd manufacture this unique id by taking the SHA256 hash of
"{record-id}:{field}:{value}" - the record of merges can be simple:

bundle:1  contains  1ef343.., 9ab20.., 
          createdby  <me>
          heuristic  <URI to Exact text + weighted fields info>
          label      "John Smith"

bundle:2  contains   bundle:1, 833ef0.., etc
          createdby  <me>
          heuristic  <URI to levenshtein+weighted match code info>

It is the bundles that are used in the frontend, with this data
structure in the background, allowing us to unpick merges when they
happen erroneously (as they will do)


Ben





More information about the open-bibliography mailing list