[open-bibliography] Disambiguation, deduplication and 'ideals'

Wed Sep 1 02:18:08 UTC 2010

This all sounds right to me in principle.
I cannot judge whether the RDF implementation is best possible, but 
something way like this seems like the way to go. My main concern is that
the conceptualization be simple enough that there is not too high an entry barrier
to provision of data in the desired format.
I'd like to hear more about your meshing tool. I've made several 
naive starts at this problem in simple cases, especially in aggregating personal
bibliographies (deduplicating bibitems) and in deduplicating lists of authors.
A simple tool and adequate data framework and UI for these tasks would be most welcome.
--Jim

----------------------------------------------
Jim Pitman
Director, Bibliographic Knowledge Network Project
http://www.bibkn.org/

Professor of Statistics and Mathematics
University of California
367 Evans Hall # 3860
Berkeley, CA 94720-3860

ph: 510-642-9970  fax: 510-642-7892
e-mail: pitman at stat.berkeley.edu
URL: http://www.stat.berkeley.edu/users/pitman
----------------------------------------------

"Ben O'Steen" <bosteen at gmail.com> wrote:

> In my work on meshing bibliographic datasets together, I've been using a
> conceptual tool that I would like to hear views on.
>
> I am creating nodes for the ideals of things on records - whether that is
> for people, journals or even the bibliographic document itself. The ideal
> represents the best and most complete data for that thing - something we'll
> never really achieve, but that's not the point. This ideal serves as a node,
> a hook,  on which we can join up records which describe the same thing
> (person, frbr manifestation, etc) but which have differing data for.
>
> It's easy to consider it for 'deduplications' of say article references.
> Consider two records, one from the ris feed from pubmed and one from a
> citation in a plos article. These are found to be references to the same
> article but as you can expect they differ, not just in terms of data but
> also on terms of the source or author of that reference.
>
> The way I am tackling this is by creating a node for the ideal bibliographic
> reference each aspires to and when dupes are believed to be found, these
> ideal nodes are joined into a bundle using sameas (in a different store) and
> this bundle has some provenance triples recording the how when and why for
> this merging (using open provenance model verbs/classes)
>
> Eg:
>
> :bibrec  ---> record node from pubmed
>
> :citerec  ---> plos record
>
> _i suffix ---> ideal node
>
> - running analyser on record suggests two records are dupes, with a certain
> confidence score from a certain weighted matching (call this 'heur.v0.13')
>
> Create ideal nodes Just In Time:
>
> :bibrec hasIdeal :bibrec_i
> :citerec hasIdeal :citerec_I
>
> Make the bundle:
>
> :b1 a Bundle
>    sameas :bibrec_i
>    sameas :citerec_I
>    opmv:wasGeneratedBy :p1
>    created: 2010-08-......
>
> :p1 a opmv:Process
>   Opmv:controlledBy :Ben
>   Opmv:used :bibrec
>   Opmv:used :citerec
>
> :confidence a ConfidenceReport
>   Opmv:wasGeneratedBy :p1
>   Hasreport <url of doc>  # for time being
>
> This structure let's me create an aggregated rdf dataset with the best guess
> ideal records at any one time. Also, bundles can be merged later if required
> creating a tree structure - the top bundle instance and the 'leaf' records
> form a congruent closure and are thus exportable as such without the admin
> structure triples necessary for ongoing maintenance. The bundle notion comes
> from the excellent work by the team at southampton,  including Hugh glazer,
> Ian milliard et al (google for coreference on the semantic web)
>
> Using this technique for entities like people is actually very similar. If I
> use the words 'person' and 'persona' for the ideal and the data in a record
> respectively. The persona can have alternative spellings, and time-dependant
> details like a fleeting institutional affiliation, and so on. The
> (difficult) trick is spotting when two persona's refer to the same person
> but the process for merging is the same even if the creation of an
> aggregated record for each is different.
>
> Ben
>
> (please forgive misspellings and the lack of url references, but I am typing
> this from a waiting room)