[open-bibliography] Deduplication

Mon Jun 21 10:51:00 UTC 2010

On Sun, 2010-06-20 at 16:03 +0100, Robin Houston wrote:
> 
> – the efficiency of step 3 could be hugely improved (from O(n^2) to
> O(n log n)) just by indexing the text fields, even if you do it in a
> simple way like dumping them all into a trie.

There is a lot of fast, reusable and - most importantly - already
written code in the lucene project that might help out here. I've had
results by using the MoreLikeThis class of query for example -
http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ -
but I haven't yet experimented with term-weighting ('boosting') with
that approach.

This is a thorny issue as (in my opinion) we aren't looking simply for
mis-types and fuzzed text, we are looking through data entered in by
people who have slightly different ideas of what should be entered in
each field for a given item. Karen has already highlighted the type of
error that typically comes from a constraint in the form ("title -
subtitle vs title" in the "title" field), but there are other errors,
especially when it comes to proper nouns and names. I have seen various
journal abbreviations appear in catalogue metadata, and to collapse
these is a problem.

As for managing the accuracy of the curated data, I would suggest paying
very close attention (and therefore recording in a data structure) the
route that the information took from source to datastore and how records
and entities were merged. I favour the route described by a Southampton
research group for managing co-reference (in the semantic web) -
http://eprints.ecs.soton.ac.uk/15245/ - essentially, every mention of an
entity in a record is given a unique id, and the decision to say that
one id is the same as another is recorded in a 'bundle', with
appropriate metadata. 

I'd manufacture this unique id by taking the SHA256 hash of
"{record-id}:{field}:{value}" - the record of merges can be simple:

bundle:1  contains  1ef343.., 9ab20.., 
          createdby  <me>
          heuristic  <URI to Exact text + weighted fields info>
          label      "John Smith"

bundle:2  contains   bundle:1, 833ef0.., etc
          createdby  <me>
          heuristic  <URI to levenshtein+weighted match code info>

It is the bundles that are used in the frontend, with this data
structure in the background, allowing us to unpick merges when they
happen erroneously (as they will do)

Ben