[open-bibliography] PubMed

Thomas Krichel krichel at openlib.org
Mon Nov 9 14:47:32 UTC 2015


  Daniel Mietchen writes

> Why not go ahead and post it here?

  ok. But pubmed is not open data, as far as I know. 

> If it's indeed too technical to be discussed here, someone might
> forward it to a more appropriate venue.

  The issue is in fact simple. How to get a complete copy of
  pubmed data? I still have to understand what the difference
  between entrez, medline and pubmed is is, but I refer to 
  complete copy as all the records that one can find in the
  web site. 

  I am a pubmed vendor, so I have access to the ftp site and the 
  data therein.  From

https://www.nlm.nih.gov/databases/journal.html

  I know that

| The approximately 2% of the records not exported to MEDLINE/PubMed
| licensees are those tagged [PubMed - as supplied by publisher] in
| PubMed.

  I suspect that a lot of the most recent additions are temporarily in
  this category. These are the ones that I am keen on getting. Waiting
  is not an option. 

  I assume they are included in the API described at

http://www.ncbi.nlm.nih.gov/books/NBK25498/

  How do I get access to all of those records, and only those? One
  way that I can come up with is to

  1. generated a list of suspected pmids
  2. check I don't have data for them 
  3. submit them to the API
  4. check response to see which one I did not get a response to,
     queue for resubmission.

  It's an approach more in tune with the Vikings, the Huns etc than
  the supposedly civilized 21st century. Is there any smarter way?  I
  have written to the NLM last week, no response yet.
  
  1 is particularly problematic. Last night's data shows I have
  24997267 records and the maximum number is 26544013. Presumably I
  could first try to harvest that interval, then, in later runs start
  a little lower and go a little higher. For 4) I could use a queue
  rule saying I will not query a record if the current waits would be
  smaller than the sum of previous waits.  But that would involve
  keeping historic harvesting data and peridically processing it.  It
  is probably best to work in ascending order even though this may
  introduce a periodicity in the harvested numbers.


-- 

  Cheers,

  Thomas Krichel                  http://openlib.org/home/krichel
                                              skype:thomaskrichel



More information about the open-bibliography mailing list