[open-bibliography] Disambiguation, deduplication and 'ideals'

William Waites william.waites at okfn.org
Wed Sep 1 07:01:40 UTC 2010


 On 10-09-01 03:45, Karen Coyle wrote:
> Doesn't a lot of this depend on how you define "same"? [...]
> Hopefully, once you determine what you mean by "same" then you can
> determine what you want to apply OWL sameAs to. 

Yes. The meaning of owl:sameAs is well defined. It means,

    If x owl:sameAs y, then the following are true:
        * for every p,o such that the triple (x,p,o) exists, the
           triple (y,p,o) is implied
        * for every s,o such that the triple (s,x,o) exists, the
           triple (s,y,o) is implied
        * for every s,p such that the triple (s,p,x) exists, the
           triple (s,p,y) is implied

As you point out, having a weaker sameAs degenerates
into "similar to", actually, "similar to in the relevant respects".
Evaluating relevance means taking into account the *intent*
of someone using the information, the *context* of any
query that might eventually be made over the data. If
anyone can come up with a tractable theory of similarity
and relevance that holds generally they deserve at least a
Nobel prise.

Ben agreed in a private mail to me (that may have been
intended for the list, there wasn't anything particularly
private in it) that owl:sameAs is probably too strong a
predicate for what he would like to accomplish.

Even if you define a weaker version of sameAs for the
intended use cases, call it similarTo, you still have to
figure out how to arrange the data so that interesting
properties, e.g. names and titles, get put in the right
place so that you can make unambiguous queries that
don't return duplicates. For example you might move
all name variants up to a PersonBundle and your
queries would always involve that and not individual
Person resources.

So what is needed to make this workable is a class,
subclass of Bundle, for each type of thing that can be
deduplicated and a generic similarTo predicate that
points to the original resource, together with rules
specific to that type of thing that say which properties
get copied. (Alternatively, a generic Bundle and a
number of thing-specific similarTo variants, and a
corresponding set of rules).

Cheers,
-w

-- 
William Waites           <william.waites at okfn.org>
Mob: +44 789 798 9965    Open Knowledge Foundation
Fax: +44 131 464 4948                Edinburgh, UK

RDF Indexing, Clustering and Inferencing in Python
		http://ordf.org/




More information about the open-bibliography mailing list