[open-bibliography] Deduplication
Ben O'Steen
bosteen at gmail.com
Mon Jun 21 14:36:32 UTC 2010
True, I had mentally skipped ahead to the field-merging/URI+RDFising
aspect which might follow on from record-deduplication :)
Variants are awkward in displays - people wish to see variants they
expect to, not all the possible ones you have found for each record
field.
Something to consider is when a search hit is made on a variant, but not
on the 'preferred' label. People want to see why a match was made, so
this might be a point where showing more than one value for a field
might benefit the user?
Ben
On Mon, 2010-06-21 at 07:02 -0700, Karen Coyle wrote:
> Ben, what you describe below brings up another question in the whole
> "merging" issue: at what level do you merge (record or field) and what
> do you do when two items are determined to be the same? There are a
> lot of considerations here, such as how much duplicate data you can
> keep around and what is your ability to resolve that duplication at
> the time of display ... etc. etc.
>
> In the original design for MELVYL, we identified the source of each
> field in the MARC record, and kept all variant fields:
>
> 245 $a Moby Dick (UCLA, UCSD)
> 245 $a Moby Dick, or, The Whale (UCB)
> 245 $a Moby Dick or The Whale (UCSC)
>
> One "source" was chosen for the record display, all were indexed.
>
> In the Aleph design, all records were kept, with a "same" pointer
> between them, but again, one was designed as the user display. Merging
> and unmerging was a matter of changing the "set" that the record's
> pointer belonged to.
>
> In both of these, we wanted to avoid displaying variant data, which
> would just confuse users. (Also we were going for a traditional
> library metadata display). So you do have to consider how you will
> reconstruct your data for display.
>
> On another note, it has been said that there are lots of variants
> among authors. We actually weighted authors quite low in our
> implementation for that reason. Some of the data elements that ended
> up being very important for merging were surprising, such as
> pagination -- which, because librarians record the highest numbered
> page from the item in hand, turned out to be a fairly accurate piece
> of data.
>
> kc
>
> Quoting Ben O'Steen <bosteen at gmail.com>:
>
>
> I favour the route described by a Southampton
> > research group for managing co-reference (in the semantic web) -
> > http://eprints.ecs.soton.ac.uk/15245/ - essentially, every mention of an
> > entity in a record is given a unique id, and the decision to say that
> > one id is the same as another is recorded in a 'bundle', with
> > appropriate metadata.
> >
> > I'd manufacture this unique id by taking the SHA256 hash of
> > "{record-id}:{field}:{value}" - the record of merges can be simple:
> >
> > bundle:1 contains 1ef343.., 9ab20..,
> > createdby <me>
> > heuristic <URI to Exact text + weighted fields info>
> > label "John Smith"
> >
> > bundle:2 contains bundle:1, 833ef0.., etc
> > createdby <me>
> > heuristic <URI to levenshtein+weighted match code info>
> >
> > It is the bundles that are used in the frontend, with this data
> > structure in the background, allowing us to unpick merges when they
> > happen erroneously (as they will do)
> >
> >
> > Ben
> >
> >
> > _______________________________________________
> > open-bibliography mailing list
> > open-bibliography at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/open-bibliography
> >
>
>
>
More information about the open-bibliography
mailing list