[open-bibliography] Deduplication

Karen Coyle kcoyle at kcoyle.net
Mon Jun 21 14:02:57 UTC 2010


Ben, what you describe below brings up another question in the whole  
"merging" issue: at what level do you merge (record or field) and what  
do you do when two items are determined to be the same? There are a  
lot of considerations here, such as how much duplicate data you can  
keep around and what is your ability to resolve that duplication at  
the time of display ... etc. etc.

In the original design for MELVYL, we identified the source of each  
field in the MARC record, and kept all variant fields:

245 $a Moby Dick (UCLA, UCSD)
245 $a Moby Dick, or, The Whale (UCB)
245 $a Moby Dick or The Whale (UCSC)

One "source" was chosen for the record display, all were indexed.

In the Aleph design, all records were kept, with a "same" pointer  
between them, but again, one was designed as the user display. Merging  
and unmerging was a matter of changing the "set" that the record's  
pointer belonged to.

In both of these, we wanted to avoid displaying variant data, which  
would just confuse users. (Also we were going for a traditional  
library metadata display). So you do have to consider how you will  
reconstruct your data for display.

On another note, it has been said that there are lots of variants  
among authors. We actually weighted authors quite low in our  
implementation for that reason. Some of the data elements that ended  
up being very important for merging were surprising, such as  
pagination -- which, because librarians record the highest numbered  
page from the item in hand, turned out to be a fairly accurate piece  
of data.

kc

Quoting Ben O'Steen <bosteen at gmail.com>:


I favour the route described by a Southampton
> research group for managing co-reference (in the semantic web) -
> http://eprints.ecs.soton.ac.uk/15245/ - essentially, every mention of an
> entity in a record is given a unique id, and the decision to say that
> one id is the same as another is recorded in a 'bundle', with
> appropriate metadata.
>
> I'd manufacture this unique id by taking the SHA256 hash of
> "{record-id}:{field}:{value}" - the record of merges can be simple:
>
> bundle:1  contains  1ef343.., 9ab20..,
>           createdby  <me>
>           heuristic  <URI to Exact text + weighted fields info>
>           label      "John Smith"
>
> bundle:2  contains   bundle:1, 833ef0.., etc
>           createdby  <me>
>           heuristic  <URI to levenshtein+weighted match code info>
>
> It is the bundles that are used in the frontend, with this data
> structure in the background, allowing us to unpick merges when they
> happen erroneously (as they will do)
>
>
> Ben
>
>
> _______________________________________________
> open-bibliography mailing list
> open-bibliography at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-bibliography
>



-- 
Karen Coyle
kcoyle at kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet





More information about the open-bibliography mailing list