[open-bibliography] Disambiguation, deduplication and 'ideals'

Ben O'Steen bosteen at gmail.com
Tue Aug 31 09:54:32 UTC 2010


In my work on meshing bibliographic datasets together, I've been using a
conceptual tool that I would like to hear views on.

I am creating nodes for the ideals of things on records - whether that is
for people, journals or even the bibliographic document itself. The ideal
represents the best and most complete data for that thing - something we'll
never really achieve, but that's not the point. This ideal serves as a node,
a hook,  on which we can join up records which describe the same thing
(person, frbr manifestation, etc) but which have differing data for.

It's easy to consider it for 'deduplications' of say article references.
Consider two records, one from the ris feed from pubmed and one from a
citation in a plos article. These are found to be references to the same
article but as you can expect they differ, not just in terms of data but
also on terms of the source or author of that reference.

The way I am tackling this is by creating a node for the ideal bibliographic
reference each aspires to and when dupes are believed to be found, these
ideal nodes are joined into a bundle using sameas (in a different store) and
this bundle has some provenance triples recording the how when and why for
this merging (using open provenance model verbs/classes)

Eg:

:bibrec  ---> record node from pubmed

:citerec  ---> plos record

_i suffix ---> ideal node

- running analyser on record suggests two records are dupes, with a certain
confidence score from a certain weighted matching (call this 'heur.v0.13')

Create ideal nodes Just In Time:

:bibrec hasIdeal :bibrec_i
:citerec hasIdeal :citerec_I

Make the bundle:

:b1 a Bundle
   sameas :bibrec_i
   sameas :citerec_I
   opmv:wasGeneratedBy :p1
   created: 2010-08-......

:p1 a opmv:Process
  Opmv:controlledBy :Ben
  Opmv:used :bibrec
  Opmv:used :citerec

:confidence a ConfidenceReport
  Opmv:wasGeneratedBy :p1
  Hasreport <url of doc>  # for time being

This structure let's me create an aggregated rdf dataset with the best guess
ideal records at any one time. Also, bundles can be merged later if required
creating a tree structure - the top bundle instance and the 'leaf' records
form a congruent closure and are thus exportable as such without the admin
structure triples necessary for ongoing maintenance. The bundle notion comes
from the excellent work by the team at southampton,  including Hugh glazer,
Ian milliard et al (google for coreference on the semantic web)

Using this technique for entities like people is actually very similar. If I
use the words 'person' and 'persona' for the ideal and the data in a record
respectively. The persona can have alternative spellings, and time-dependant
details like a fleeting institutional affiliation, and so on. The
(difficult) trick is spotting when two persona's refer to the same person
but the process for merging is the same even if the creation of an
aggregated record for each is different.

Ben

(please forgive misspellings and the lack of url references, but I am typing
this from a waiting room)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-bibliography/attachments/20100831/83c7c957/attachment.html>


More information about the open-bibliography mailing list