[open-bibliography] Disambiguation, deduplication and 'ideals'
Christopher Gutteridge
cjg at ecs.soton.ac.uk
Wed Sep 1 13:01:30 UTC 2010
Hi. I've got a couple of points on this subject with two different "Hats"
/Chris-the-eprints-guy: /
EPrints 3.2.1 onwards provides some basic linked data, but the relevant
thing is it also produces good URIs suitable for linking via a
de-dupping service. It mints URIs for its metadata records, its
documents, the files in its documents (OK, those are just URLs of
files), which you would expect.
It also makes URIs for people, conferences, conference locations,
publications and organisations. This is done by hashing known data about
them and a few assumptions.
Example of our URIs can be seen here, a conference paper.
http://eprints.ecs.soton.ac.uk/cgi/export/eprint/21498/RDFN3/ecs-eprint-21498.n3?mimetype=text/plain
My hope is that this will be a good starting point for dedupping
services. We are open to suggestions from people who want to actively
use our data, and not that open to people who want us to do it "right"
based on their set of beliefs.
/Chris-the-RDF-guy:
/It's OK to say to similar things are the same, if they are the same for
the purposes of your dataset. A dataset about movie scripts would
consider the Adam West Batman very different from the Batman Begins
Batman. However a dataset mapping what comic book characters appear in
movies would probably treat both as the same, and declare their #batman
to be the same as all the comic book incarnations and all the movie
incarnations. They are all the sameAs your concept of batman FOR YOUR
PURPOSES.
sameAs is a subjective term. Honouring it semantically should be based
on the source of the assertion. Sources should aim to be clear and
consistent in how they sameAs things.
Just accept that sameAs is going to drift its meaning as we now have a
tide of non-logic-obsessed people building RDF, who don't give a durn
for your OWL.
On 01/09/10 09:19, Ben O'Steen wrote:
> On Wed, 2010-09-01 at 05:08 +0200, Thomas Krichel wrote:
>
>> Karen Coyle writes
>>
>>
>>> As you can see, the questions go on and on!
>>>
>>
>> Deduplication is also service context dependent. ...
>>
>
> I absolutely agree and I'll also say that when you are de-duplicating
> for any of these reasons, you will be using some probabilistic method of
> some kind, 99% of the time ;) Whether it's a fellegi-sunter based whole
> record dedupe, or single field (eg id) matching, there will be false
> positives and false negatives.
>
> Your success rate will always be<100%, and the degree of success will
> vary depending on who and for what purpose this was done.
>
> Ben
>
>
> _______________________________________________
> open-bibliography mailing list
> open-bibliography at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-bibliography
>
--
Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248
/ Lead Developer, EPrints Project, http://eprints.org/
/ Web Projects Manager, ECS, University of Southampton, http://www.ecs.soton.ac.uk/
/ Webmaster, Web Science Trust, http://www.webscience.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-bibliography/attachments/20100901/8648bd60/attachment-0001.html>
More information about the open-bibliography
mailing list