[open-bibliography] Disambiguation, deduplication and 'ideals'
Dan Matei
dan at cimec.ro
Tue Aug 31 10:09:45 UTC 2010
Mutatis mutandis, the Europeana project (i.e. its conceptual model EDM) uses the same idea: there
are "ideals" (as you call them) for each object and ore:proxy-s (descriptions) associated with it.
Take a look at http://version1.europeana.eu/web/europeana-project/technicaldocuments/ (mainly EDM
primer).
Dan
----------------------------------------------------------------------------
Dan Matei, director
Institutul de Memorie Culturala [Institute for Cultural Memory] (CIMEC)
Pia?a Presei Libere nr. 1, CP 33-90
013701 Bucure?ti [Bucharest], Romania
Tel. (+4)21 317 90 72, Fax (+4)21 317 90 64
www.cimec.ro
> -----Original Message-----
> From: open-bibliography-bounces at lists.okfn.org
> [mailto:open-bibliography-bounces at lists.okfn.org] On Behalf
> Of Ben O'Steen
> Sent: 31 august 2010 12:55
> To: List for Working Group on Open Bibliographic Data
> Subject: [open-bibliography] Disambiguation, deduplication
> and 'ideals'
>
> In my work on meshing bibliographic datasets together, I've
> been using a conceptual tool that I would like to hear views on.
>
> I am creating nodes for the ideals of things on records -
> whether that is for people, journals or even the
> bibliographic document itself. The ideal represents the best
> and most complete data for that thing - something we'll never
> really achieve, but that's not the point. This ideal serves
> as a node, a hook, on which we can join up records which
> describe the same thing (person, frbr manifestation, etc) but
> which have differing data for.
>
> It's easy to consider it for 'deduplications' of say article
> references. Consider two records, one from the ris feed from
> pubmed and one from a citation in a plos article. These are
> found to be references to the same article but as you can
> expect they differ, not just in terms of data but also on
> terms of the source or author of that reference.
>
> The way I am tackling this is by creating a node for the
> ideal bibliographic reference each aspires to and when dupes
> are believed to be found, these ideal nodes are joined into a
> bundle using sameas (in a different store) and this bundle
> has some provenance triples recording the how when and why
> for this merging (using open provenance model verbs/classes)
>
> Eg:
>
> :bibrec ---> record node from pubmed
>
> :citerec ---> plos record
>
> _i suffix ---> ideal node
>
> - running analyser on record suggests two records are dupes,
> with a certain confidence score from a certain weighted
> matching (call this 'heur.v0.13')
>
> Create ideal nodes Just In Time:
>
> :bibrec hasIdeal :bibrec_i
> :citerec hasIdeal :citerec_I
>
> Make the bundle:
>
> :b1 a Bundle
> sameas :bibrec_i
> sameas :citerec_I
> opmv:wasGeneratedBy :p1
> created: 2010-08-......
>
> :p1 a opmv:Process
> Opmv:controlledBy :Ben
> Opmv:used :bibrec
> Opmv:used :citerec
>
> :confidence a ConfidenceReport
> Opmv:wasGeneratedBy :p1
> Hasreport <url of doc> # for time being
>
> This structure let's me create an aggregated rdf dataset with
> the best guess ideal records at any one time. Also, bundles
> can be merged later if required creating a tree structure -
> the top bundle instance and the 'leaf' records form a
> congruent closure and are thus exportable as such without the
> admin structure triples necessary for ongoing maintenance.
> The bundle notion comes from the excellent work by the team
> at southampton, including Hugh glazer, Ian milliard et al
> (google for coreference on the semantic web)
>
> Using this technique for entities like people is actually
> very similar. If I use the words 'person' and 'persona' for
> the ideal and the data in a record respectively. The persona
> can have alternative spellings, and time-dependant details
> like a fleeting institutional affiliation, and so on. The
> (difficult) trick is spotting when two persona's refer to the
> same person but the process for merging is the same even if
> the creation of an aggregated record for each is different.
>
> Ben
>
> (please forgive misspellings and the lack of url references,
> but I am typing this from a waiting room)
>
>
More information about the open-bibliography
mailing list