[open-bibliography] Disambiguation, deduplication and 'ideals'

Ben O'Steen bosteen at gmail.com
Wed Sep 1 07:56:32 UTC 2010


Sorry for the direct reply to Will, it wasn't secret - I hadn't checked
the address once I hit 'Reply-to' :)

Essentially, I said that I felt that the notion of 'sameas' worked fine
when 'merging' or connecting the ideals (or their subsequent bundles).

The reason I used the word Ideal is that it was meant as a mythical,
perfect metadata record from which all other record representations
might be derived. The desired implication is that all manifestations,
all records are incomplete representations of this data.

This is not to say that an aggregation of triples from all the sources
would get us closer to an Ideal - far from it. Any representation has a
bias, whether personal, conventional or otherwise. For example, the same
record might have representations in MARC, RIS, or any number of RDF
modelled versions - it could use a DC/PRISM model or FRBR/RDA in RDF or
some other varient, heavy in literals or heavy in URIs.

In response to Karen's point about the definition of 'same', it is down
to the context and an acknowledgement that you can't please all of the
people, all of the time. I think the phrase "same, for the intents and
purposes of ...." helps point out what I am intending. 

There is nothing wrong with duplication. We deduplicate to make certain
tasks much easier and it is those tasks that define the 'sameness' that
we should aim for. 

This is why the technique of ideals and closures I described doesn't
affect or alter the source data and why nothing irreversible happens
without a very, very good reason. I am going to pair things up
erroneously, and I am not going to be able to deduplicate whole datasets
before they are used. I can attempt to make unpicking and merging on the
fly as painless as possible and account for the fact that records may
change after the fact which this is my attempt to do.

The ideal records won't have any real metadata on them, save for admin
triples - they will point to aggregations or format variations of what
is my 'best guess' version of a record at the current time.

Ben

On Wed, 2010-09-01 at 08:01 +0100, William Waites wrote:
> On 10-09-01 03:45, Karen Coyle wrote:
> > Doesn't a lot of this depend on how you define "same"? [...]
> > Hopefully, once you determine what you mean by "same" then you can
> > determine what you want to apply OWL sameAs to. 
> 
> Yes. The meaning of owl:sameAs is well defined. It means,
> 
>     If x owl:sameAs y, then the following are true:
>         * for every p,o such that the triple (x,p,o) exists, the
>            triple (y,p,o) is implied
>         * for every s,o such that the triple (s,x,o) exists, the
>            triple (s,y,o) is implied
>         * for every s,p such that the triple (s,p,x) exists, the
>            triple (s,p,y) is implied
> 
> As you point out, having a weaker sameAs degenerates
> into "similar to", actually, "similar to in the relevant respects".
> Evaluating relevance means taking into account the *intent*
> of someone using the information, the *context* of any
> query that might eventually be made over the data. If
> anyone can come up with a tractable theory of similarity
> and relevance that holds generally they deserve at least a
> Nobel prise.
> 
> Ben agreed in a private mail to me (that may have been
> intended for the list, there wasn't anything particularly
> private in it) that owl:sameAs is probably too strong a
> predicate for what he would like to accomplish.
> 
> Even if you define a weaker version of sameAs for the
> intended use cases, call it similarTo, you still have to
> figure out how to arrange the data so that interesting
> properties, e.g. names and titles, get put in the right
> place so that you can make unambiguous queries that
> don't return duplicates. For example you might move
> all name variants up to a PersonBundle and your
> queries would always involve that and not individual
> Person resources.
> 
> So what is needed to make this workable is a class,
> subclass of Bundle, for each type of thing that can be
> deduplicated and a generic similarTo predicate that
> points to the original resource, together with rules
> specific to that type of thing that say which properties
> get copied. (Alternatively, a generic Bundle and a
> number of thing-specific similarTo variants, and a
> corresponding set of rules).
> 
> Cheers,
> -w
> 






More information about the open-bibliography mailing list