[open-bibliography] Deduplication
Karen Coyle
kcoyle at kcoyle.net
Mon Jun 21 14:02:57 UTC 2010
Ben, what you describe below brings up another question in the whole
"merging" issue: at what level do you merge (record or field) and what
do you do when two items are determined to be the same? There are a
lot of considerations here, such as how much duplicate data you can
keep around and what is your ability to resolve that duplication at
the time of display ... etc. etc.
In the original design for MELVYL, we identified the source of each
field in the MARC record, and kept all variant fields:
245 $a Moby Dick (UCLA, UCSD)
245 $a Moby Dick, or, The Whale (UCB)
245 $a Moby Dick or The Whale (UCSC)
One "source" was chosen for the record display, all were indexed.
In the Aleph design, all records were kept, with a "same" pointer
between them, but again, one was designed as the user display. Merging
and unmerging was a matter of changing the "set" that the record's
pointer belonged to.
In both of these, we wanted to avoid displaying variant data, which
would just confuse users. (Also we were going for a traditional
library metadata display). So you do have to consider how you will
reconstruct your data for display.
On another note, it has been said that there are lots of variants
among authors. We actually weighted authors quite low in our
implementation for that reason. Some of the data elements that ended
up being very important for merging were surprising, such as
pagination -- which, because librarians record the highest numbered
page from the item in hand, turned out to be a fairly accurate piece
of data.
kc
Quoting Ben O'Steen <bosteen at gmail.com>:
I favour the route described by a Southampton
> research group for managing co-reference (in the semantic web) -
> http://eprints.ecs.soton.ac.uk/15245/ - essentially, every mention of an
> entity in a record is given a unique id, and the decision to say that
> one id is the same as another is recorded in a 'bundle', with
> appropriate metadata.
>
> I'd manufacture this unique id by taking the SHA256 hash of
> "{record-id}:{field}:{value}" - the record of merges can be simple:
>
> bundle:1 contains 1ef343.., 9ab20..,
> createdby <me>
> heuristic <URI to Exact text + weighted fields info>
> label "John Smith"
>
> bundle:2 contains bundle:1, 833ef0.., etc
> createdby <me>
> heuristic <URI to levenshtein+weighted match code info>
>
> It is the bundles that are used in the frontend, with this data
> structure in the background, allowing us to unpick merges when they
> happen erroneously (as they will do)
>
>
> Ben
>
>
> _______________________________________________
> open-bibliography mailing list
> open-bibliography at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-bibliography
>
--
Karen Coyle
kcoyle at kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet
More information about the open-bibliography
mailing list