[open-bibliography] Disambiguation, deduplication and 'ideals'

Karen Coyle kcoyle at kcoyle.net
Wed Sep 1 02:45:12 UTC 2010


Doesn't a lot of this depend on how you define "same"? This is rather  
complex today due to the replication of resources in various formats.  
Is it the same if one citation is to the print copy of the article,  
and another cites a postprint in a repository? (The latter may have  
different formatting or different pagination depending on how it was  
produced.) Is it that the texts are the same, or that the actual  
publication is the same? (Which would mean that the article in the  
journal and the postprint are considered different resources.) Do you  
intend to distinguish between preprints and published articles? Would  
you consider a piece of music on CD and its online copy to be the  
same, even though the compression is different? As you can see, the  
questions go on and on!

There's also the question of confidence: what degree of confidence do  
you need before declaring "same"? Do you prefer to err on the side of  
over-bundling or under-bundling? This interacts with your definition  
of sameness -- for example, let's say you decide that if you have two  
texts with the same author, same title, and same year, they are the  
same. You won't check for format, length, etc. That would probably  
bring together pre-prints, post-prints, and possibly even different  
versions of the same text. So in that way you have defined "same"  
rather broadly. Or you could decide that two texts are the same only  
if they both cite precisely the same publication with the same  
pagination and full date. That is a different definition of same, with  
more confidence.

Hopefully, once you determine what you mean by "same" then you can  
determine what you want to apply OWL sameAs to.

As you can tell, I have agonized over this one for many a long hour. :-)

kc

Quoting William Waites <william.waites at okfn.org>:

>  On 10-08-31 10:54, Ben O'Steen wrote:
>> :b1 a Bundle
>>
>>    sameas :bibrec_i
>>    sameas :citerec_I
>>    opmv:wasGeneratedBy :p1
>>    created: 2010-08-......
>>
>
> I think this is reasoable as far as it goes. What you
> haven't treated is how the properties present on
> bibrec_i, citerec_i get migrated around.
>
> The OWL rule that we're trying to bypass is,
>
> { ?x owl:sameAs ?y . ?x ?p ?o } => { ?y ?p ?o }
>
> so do all properties from bibrec_i and citerec_i
> get migrated to b1 and then any post-dedup queries
> would typically be made against b1?
>
> Also mind the implication of sameas. It commutes
> so what you are also saying here is,
>
>     :b1 a Person .
>     :bibrec_i a Bundle .
>     :citerec_i a Bundle .
>
> So maybe you don't really mean owl:sameAs on the
> bundle.
>
> But if you don't mean owl:sameAs there, then where
> do you put the properties? (Potential answer: pick
> either bibrec_i or citerec_i arbitrarily and have a
> specific predicate in the bundle to indicate that it
> is the "primary" resource).
>
> Cheers,
> -w
>
> --
> William Waites           <william.waites at okfn.org>
> Mob: +44 789 798 9965    Open Knowledge Foundation
> Fax: +44 131 464 4948                Edinburgh, UK
>
> RDF Indexing, Clustering and Inferencing in Python
> 		http://ordf.org/
>
> _______________________________________________
> open-bibliography mailing list
> open-bibliography at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-bibliography
>



-- 
Karen Coyle
kcoyle at kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet





More information about the open-bibliography mailing list