[open-bibliography] Misciting identifiers (openbiblio workshop follow-up)

Alexander Dutton alexander.dutton at oucs.ox.ac.uk
Mon May 23 14:52:30 UTC 2011


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

At the openbiblio workshop it was suggested that I try to quantify
what proportion of citations were incorrect in some way. To start
doing this, I've been visualising the networks produced when one
groups references to papers by common identifiers. It seems that DOIs
are vastly more often wrong than any other type of identifier.


Were the world perfect, each group would contain at most one of each
type of identifier, and each reference would be a close match for each
other reference in that group. Where a reference miscites an
identifier it will generally stick out from the network it becomes
part of.

So, I'd like to present a lot of pretty graphs:
<http://alexdutton.co.uk/reclusterings/>. Here, coloured nodes are
references to papers (either citations, NLM XML headers from the OA
subset, or the result of querying the pubmed API for a header for a
particular PMID). The different colours represent groups of references
clustered by similarity. The white nodes are identifiers, with the
shape representing the scheme (DOI, PMID, PMC, URI, etc). The edges
mean "this paper is associated with this identifier".

Having done this it seems that there are a few different types of
miscitation, so I've picked out some exemplars:


<http://alexdutton.co.uk/reclusterings/00000119.png>

The lone green thing is a citation that likely has an incorrect DOI.


<http://alexdutton.co.uk/reclusterings/00000103.png>

This is probably a failure of my clustering, though I can't be sure
without checking.


<http://alexdutton.co.uk/reclusterings/00000514.png>

This looks to be a bug in the data returned by the pubmed API. There are
a number of PMIDs that elicit a DOI of '10.1093/aje/'. e.g.
<http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&rettype=xml&id=15583373>.


<http://alexdutton.co.uk/reclusterings/00000142.png>

Awkward in that it's impossible to work out which citation of that DOI
in the middle is correct.


<http://alexdutton.co.uk/reclusterings/00000534.png>

This one looks rather clear cut.


<http://alexdutton.co.uk/reclusterings/00000546.png>

This one looks odd as the green cluster are so separated. The fault here
seems to be with the paper with PMID ID 20948869, where reference 26
(the far left green one) and reference 27 (the rightmost red one) have
their PMIDs swapped.


<http://alexdutton.co.uk/reclusterings/00000303.png>

Also fairly clear cut.


<http://alexdutton.co.uk/reclusterings/00000256.png>

There are lots of these, and looking at the data behind them, the stuff
that comes back from the API looks nothing like the data on the reference.


- -- 
Alexander Dutton
OpenCitations developer
Department of Zoology, University of Oxford, ℡ 01865 (6)13483
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iEYEARECAAYFAk3adK4ACgkQS0pRIabRbjA3+wCcD48zt1CsiEX8eAFStlNPvMdM
hqIAniY5FCrEHqeU/0ytRpNcIAfAIT33
=qZ0r
-----END PGP SIGNATURE-----





More information about the open-bibliography mailing list