[open-bibliography] Disambiguation, deduplication and 'ideals'

Wed Sep 1 13:01:30 UTC 2010

Hi. I've got a couple of points on this subject with two different "Hats"

/Chris-the-eprints-guy: /
EPrints 3.2.1 onwards provides some basic linked data, but the relevant 
thing is it also produces good URIs suitable for linking via a 
de-dupping service. It mints URIs for its metadata records, its 
documents, the files in its documents (OK, those are just URLs of 
files), which you would expect.

It also makes URIs for people, conferences, conference locations, 
publications and organisations. This is done by hashing known data about 
them and a few assumptions.

Example of our URIs can be seen here, a conference paper.
http://eprints.ecs.soton.ac.uk/cgi/export/eprint/21498/RDFN3/ecs-eprint-21498.n3?mimetype=text/plain

My hope is that this will be a good starting point for dedupping 
services. We are open to suggestions from people who want to actively 
use our data, and not that open to people who want us to do it "right" 
based on their set of beliefs.

/Chris-the-RDF-guy:
/It's OK to say to similar things are the same, if they are the same for 
the purposes of your dataset. A dataset about movie scripts would 
consider the Adam West Batman very different from the Batman Begins 
Batman. However a dataset mapping what comic book characters appear in 
movies would probably treat both as the same, and declare their #batman 
to be the same as all the comic book incarnations and all the movie 
incarnations. They are all the sameAs your concept of batman FOR YOUR 
PURPOSES.

sameAs is a subjective term. Honouring it semantically should be based 
on the source of the assertion. Sources should aim to be clear and 
consistent in how they sameAs things.

Just accept that sameAs is going to drift its meaning as we now have a 
tide of non-logic-obsessed people building RDF, who don't give a durn 
for your OWL.

On 01/09/10 09:19, Ben O'Steen wrote:
> On Wed, 2010-09-01 at 05:08 +0200, Thomas Krichel wrote:
>    
>> Karen Coyle writes
>>
>>      
>>> As you can see, the questions go on and on!
>>>        
>>
>>    Deduplication is also service context dependent. ...
>>      
>
> I absolutely agree and I'll also say that when you are de-duplicating
> for any of these reasons, you will be using some probabilistic method of
> some kind, 99% of the time ;) Whether it's a fellegi-sunter based whole
> record dedupe, or single field (eg id) matching, there will be false
> positives and false negatives.
>
> Your success rate will always be<100%, and the degree of success will
> vary depending on who and for what purpose this was done.
>
> Ben
>
>
> _______________________________________________
> open-bibliography mailing list
> open-bibliography at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-bibliography
>    

-- 
Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248

/ Lead Developer, EPrints Project, http://eprints.org/
/ Web Projects Manager, ECS, University of Southampton, http://www.ecs.soton.ac.uk/
/ Webmaster, Web Science Trust, http://www.webscience.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-bibliography/attachments/20100901/8648bd60/attachment-0001.html>