[open-bibliography] Deduplication

Ben O'Steen bosteen at gmail.com
Mon Jun 21 14:36:32 UTC 2010


True, I had mentally skipped ahead to the field-merging/URI+RDFising
aspect which might follow on from record-deduplication :)

Variants are awkward in displays - people wish to see variants they
expect to, not all the possible ones you have found for each record
field. 

Something to consider is when a search hit is made on a variant, but not
on the 'preferred' label. People want to see why a match was made, so
this might be a point where showing more than one value for a field
might benefit the user?

Ben

On Mon, 2010-06-21 at 07:02 -0700, Karen Coyle wrote:
> Ben, what you describe below brings up another question in the whole  
> "merging" issue: at what level do you merge (record or field) and what  
> do you do when two items are determined to be the same? There are a  
> lot of considerations here, such as how much duplicate data you can  
> keep around and what is your ability to resolve that duplication at  
> the time of display ... etc. etc.
> 
> In the original design for MELVYL, we identified the source of each  
> field in the MARC record, and kept all variant fields:
> 
> 245 $a Moby Dick (UCLA, UCSD)
> 245 $a Moby Dick, or, The Whale (UCB)
> 245 $a Moby Dick or The Whale (UCSC)
> 
> One "source" was chosen for the record display, all were indexed.
> 
> In the Aleph design, all records were kept, with a "same" pointer  
> between them, but again, one was designed as the user display. Merging  
> and unmerging was a matter of changing the "set" that the record's  
> pointer belonged to.
> 
> In both of these, we wanted to avoid displaying variant data, which  
> would just confuse users. (Also we were going for a traditional  
> library metadata display). So you do have to consider how you will  
> reconstruct your data for display.
> 
> On another note, it has been said that there are lots of variants  
> among authors. We actually weighted authors quite low in our  
> implementation for that reason. Some of the data elements that ended  
> up being very important for merging were surprising, such as  
> pagination -- which, because librarians record the highest numbered  
> page from the item in hand, turned out to be a fairly accurate piece  
> of data.
> 
> kc
> 
> Quoting Ben O'Steen <bosteen at gmail.com>:
> 
> 
> I favour the route described by a Southampton
> > research group for managing co-reference (in the semantic web) -
> > http://eprints.ecs.soton.ac.uk/15245/ - essentially, every mention of an
> > entity in a record is given a unique id, and the decision to say that
> > one id is the same as another is recorded in a 'bundle', with
> > appropriate metadata.
> >
> > I'd manufacture this unique id by taking the SHA256 hash of
> > "{record-id}:{field}:{value}" - the record of merges can be simple:
> >
> > bundle:1  contains  1ef343.., 9ab20..,
> >           createdby  <me>
> >           heuristic  <URI to Exact text + weighted fields info>
> >           label      "John Smith"
> >
> > bundle:2  contains   bundle:1, 833ef0.., etc
> >           createdby  <me>
> >           heuristic  <URI to levenshtein+weighted match code info>
> >
> > It is the bundles that are used in the frontend, with this data
> > structure in the background, allowing us to unpick merges when they
> > happen erroneously (as they will do)
> >
> >
> > Ben
> >
> >
> > _______________________________________________
> > open-bibliography mailing list
> > open-bibliography at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/open-bibliography
> >
> 
> 
> 






More information about the open-bibliography mailing list