[open-bibliography] Disambiguation, deduplication and 'ideals'

Tue Aug 31 10:09:45 UTC 2010

Mutatis mutandis, the Europeana project (i.e. its conceptual model EDM) uses the same idea: there
are "ideals" (as you call them) for each object and ore:proxy-s (descriptions) associated with it.
Take a look at http://version1.europeana.eu/web/europeana-project/technicaldocuments/ (mainly EDM
primer).

Dan

----------------------------------------------------------------------------
Dan Matei, director
Institutul de Memorie Culturala [Institute for Cultural Memory] (CIMEC)
Pia?a Presei Libere nr. 1, CP 33-90
013701 Bucure?ti [Bucharest], Romania
Tel. (+4)21 317 90 72, Fax (+4)21 317 90 64
www.cimec.ro

> -----Original Message-----
> From: open-bibliography-bounces at lists.okfn.org 
> [mailto:open-bibliography-bounces at lists.okfn.org] On Behalf 
> Of Ben O'Steen
> Sent: 31 august 2010 12:55
> To: List for Working Group on Open Bibliographic Data
> Subject: [open-bibliography] Disambiguation, deduplication 
> and 'ideals'
> 
> In my work on meshing bibliographic datasets together, I've 
> been using a conceptual tool that I would like to hear views on.
> 
> I am creating nodes for the ideals of things on records - 
> whether that is for people, journals or even the 
> bibliographic document itself. The ideal represents the best 
> and most complete data for that thing - something we'll never 
> really achieve, but that's not the point. This ideal serves 
> as a node, a hook,  on which we can join up records which 
> describe the same thing (person, frbr manifestation, etc) but 
> which have differing data for.
> 
> It's easy to consider it for 'deduplications' of say article 
> references. Consider two records, one from the ris feed from 
> pubmed and one from a citation in a plos article. These are 
> found to be references to the same article but as you can 
> expect they differ, not just in terms of data but also on 
> terms of the source or author of that reference. 
> 
> The way I am tackling this is by creating a node for the 
> ideal bibliographic reference each aspires to and when dupes 
> are believed to be found, these ideal nodes are joined into a 
> bundle using sameas (in a different store) and this bundle 
> has some provenance triples recording the how when and why 
> for this merging (using open provenance model verbs/classes)
> 
> Eg:
> 
> :bibrec  ---> record node from pubmed
> 
> :citerec  ---> plos record
> 
> _i suffix ---> ideal node
> 
> - running analyser on record suggests two records are dupes, 
> with a certain confidence score from a certain weighted 
> matching (call this 'heur.v0.13')
> 
> Create ideal nodes Just In Time:
> 
> :bibrec hasIdeal :bibrec_i
> :citerec hasIdeal :citerec_I
> 
> Make the bundle:
> 
> :b1 a Bundle
>    sameas :bibrec_i
>    sameas :citerec_I
>    opmv:wasGeneratedBy :p1
>    created: 2010-08-......
> 
> :p1 a opmv:Process
>   Opmv:controlledBy :Ben
>   Opmv:used :bibrec
>   Opmv:used :citerec
> 
> :confidence a ConfidenceReport
>   Opmv:wasGeneratedBy :p1
>   Hasreport <url of doc>  # for time being
> 
> This structure let's me create an aggregated rdf dataset with 
> the best guess ideal records at any one time. Also, bundles 
> can be merged later if required creating a tree structure - 
> the top bundle instance and the 'leaf' records form a 
> congruent closure and are thus exportable as such without the 
> admin structure triples necessary for ongoing maintenance. 
> The bundle notion comes from the excellent work by the team 
> at southampton,  including Hugh glazer, Ian milliard et al 
> (google for coreference on the semantic web)
> 
> Using this technique for entities like people is actually 
> very similar. If I use the words 'person' and 'persona' for 
> the ideal and the data in a record respectively. The persona 
> can have alternative spellings, and time-dependant details 
> like a fleeting institutional affiliation, and so on. The 
> (difficult) trick is spotting when two persona's refer to the 
> same person but the process for merging is the same even if 
> the creation of an aggregated record for each is different. 
> 
> Ben
> 
> (please forgive misspellings and the lack of url references, 
> but I am typing this from a waiting room)
> 
>