[openbiblio-dev] Thoughts on Deduping and Congruence Closure

Sat Aug 7 18:02:38 UTC 2010

This involves handling owl:sameAs. If anyone has thoughts
or suggestions I'd like to hear them. This is a murky area not
handled well by any current tools.

What I propose to do in the ORDF back end (possibly as a
configuration setting in case there are unforseen implications)
is the following.

Suppose two duplicate graphs are identified, g1 and g2. Pick
one of them arbitrarily to be the main storage of other information,
say g1. Migrate all triples in g2 to g1.  g2 should now be empty.
Insert a /g2 owl:sameAs g1/ into g2.

When a get() is made for g2, the store will now check that,
if g2 has only one triple and it is an /owl:sameAs/ it will
transparently return g1.

If a further operation that tries to add triples to g2 is performed,
it will check and see that there is an /owl:sameAs/ pointing at
g1 which is already in the store and migrate all other triples
there.

If a third duplicate is later identified, g3, it needs to be ensured
that its /owl:sameAs/ points to g1 and not g2 to prevent having
to traverse the moral equivalent of a chain of symlinks...

I'm told this concept is called Congruence Closure, where given
a set of things that are the same, one of them is chosen to be
the real one and all others become pointers to it.

Thoughts? Ideas? Is this sort of stealth dereferencing by the
storage layer a reasonable strategy?

Cheers,
-w

-- 
William Waites           <william.waites at okfn.org>
Mob: +44 789 798 9965    Open Knowledge Foundation
Fax: +44 131 464 4948                Edinburgh, UK

RDF Indexing, Clustering and Inferencing in Python
		http://ordf.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openbiblio-dev/attachments/20100807/fa3c862b/attachment.html>