[open-bibliography] PubMed
Thomas Krichel
krichel at openlib.org
Mon Nov 9 14:47:32 UTC 2015
Daniel Mietchen writes
> Why not go ahead and post it here?
ok. But pubmed is not open data, as far as I know.
> If it's indeed too technical to be discussed here, someone might
> forward it to a more appropriate venue.
The issue is in fact simple. How to get a complete copy of
pubmed data? I still have to understand what the difference
between entrez, medline and pubmed is is, but I refer to
complete copy as all the records that one can find in the
web site.
I am a pubmed vendor, so I have access to the ftp site and the
data therein. From
https://www.nlm.nih.gov/databases/journal.html
I know that
| The approximately 2% of the records not exported to MEDLINE/PubMed
| licensees are those tagged [PubMed - as supplied by publisher] in
| PubMed.
I suspect that a lot of the most recent additions are temporarily in
this category. These are the ones that I am keen on getting. Waiting
is not an option.
I assume they are included in the API described at
http://www.ncbi.nlm.nih.gov/books/NBK25498/
How do I get access to all of those records, and only those? One
way that I can come up with is to
1. generated a list of suspected pmids
2. check I don't have data for them
3. submit them to the API
4. check response to see which one I did not get a response to,
queue for resubmission.
It's an approach more in tune with the Vikings, the Huns etc than
the supposedly civilized 21st century. Is there any smarter way? I
have written to the NLM last week, no response yet.
1 is particularly problematic. Last night's data shows I have
24997267 records and the maximum number is 26544013. Presumably I
could first try to harvest that interval, then, in later runs start
a little lower and go a little higher. For 4) I could use a queue
rule saying I will not query a record if the current waits would be
smaller than the sum of previous waits. But that would involve
keeping historic harvesting data and peridically processing it. It
is probably best to work in ascending order even though this may
introduce a periodicity in the harvested numbers.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
More information about the open-bibliography
mailing list