[openbiblio-dev] Medline and UKPMC bibliography

Peter Murray-Rust pm286 at cam.ac.uk
Thu Feb 2 19:37:36 UTC 2012

I have re-uploaded

The pubmed "RIS" is very seriously broken WRT to the Thomson Reuters "RIS"
spec. I am assuming that pubmed has guessed and got it wrong. Here are some
of the Pubmed problems:
* entries should start with TY, PMs don't
* entries should end with ER, PMs don't
* lines should start should be [A-Z][A-Z0-9]\s\s\-\s, PM can use three
characters or 4
*PM makes up lots of their own tags. Not sure whether this is allowed

There is a different problem with whitespace. In PM we have:
DP  - 2009 Sep-Oct
TI  - [Red-breasted goose colonies on the Taimyr Peninsula: factors
responsible for the
      proximity of goose nests to nests of peregrine falcons, rough-legged
      and snowy owls].
PG  - 559-68

The current parser skips the two whitespace-prefixed lines whereas it
should concatenate them to the preceding.

This is a typical mess - we have this all the time in chemistry. People
"improve" file formats and break software. My guess is that we should have
a separate parse for MEDLINE. Else we have to put lots of if/thens into the
RIS parser.


Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openbiblio-dev/attachments/20120202/dd3a6e0e/attachment.html>

More information about the openbiblio-dev mailing list