[openbiblio-dev] Distributed scraping using flockscrape

Fri Jan 21 12:03:57 UTC 2011

Key to scraping PMC is OAIPMH - meaning, to be efficient, we need to
divide up the 'history' of the site into chunks and to start the
sequential harvesting.

The dates that the OAIPMH service API uses are the 'aquisition' dates,
the date at which a given record appeared in PMC's db. Note that this
has nothing to do with publication date!

You ask the API for access to the records that appeared within a date
range, and it passes back a partial set of records and a
'resumptionToken'. You need this token to construct the next URL to
get the next part of the set for the asked for date range. When you
dont get a resumptionToken in the response, you should have all the
records you asked for.

The reason why we need to chunk up by date is to allow for more than
one 'client' to start one of these sequential requests and also, in
anticipation that for some periods, we may hit server errors, bad XML
and so on, things which dont allow us to get the next resumption
token. (From my experience of it, it certainly uses perl and string
concatenation to build up the responses - you absolutely cannot
guarantee valid XML!)

If we can manufacture a task that can deal with the two modes of
download (initial request and resumption requests)

Some figures:
PubMed Central claims to have started in 2000 - that would make for
286 two-week chunks
Oft-repeated phrase is that an article is added to Pubmed every three seconds.

URLs:

Initial request looks like: (Caps signify variables)

http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=pmc&from=FROMDATE&until=TODATE

http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&resumptionToken=RESUMPTIONTOKEN
(That's the official way - we may need to stick in the from and until
params too, if the service gets confused on certain ranges)

The resumption tokens coming from the service seem to encode the last
internal article id you got to and the dates - not entirely guessable
as the ids are erratic and no doubt used in a db query. (incidentally,
a likely db injection vector too... but I digress)

Response:
Should be an XML response that sort of looks like this:

<OAIPMH ...>
...
  <ListRecords>
     <record>
        <metadata>  the meat of the record response is in here. Should
be a chunk of NLM DTD xml.</metadata>
        ...
     </record>
     <record>
     ....
     <!--  optionally: -->
     <resumptionToken>Magiccodegoeshere</resumptionToken>
   </ListRecords>
</OAIPMH>

Errors:
We'll probably hit server errors, blocks and so on, but there are some
OAIPMH specific error responses to watch out for:

OAIPMH Error looks something like:

<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
         http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
  <responseDate>2002-05-01T09:18:29Z</responseDate>
  <request>http://arXiv.org/oai2</request>
  <error code="ERRORCODE">Error Description</error>
</OAI-PMH>

The ERRORCODEs to watch out for:

"badResumptionToken" - The service doesn't like the resumption token
which it no doubt just gave you
"noRecordsMatch" - There's no records for that given combo of from +
until. Doesn't mean that the records don't exist, just that the
service can't find it in the db. Could be a timeout or something
similar.

Suggestions:

- Use a regex to get the resumptionToken - can't guarantee that the
XML won't choke a parser (without BeautifulStoneSoup perhaps)
- Use 2 week chunks, as we may find dates that the service fails to
supply. Hard to overlap data from a failed range without working out
how far it got in terms of aquisition date.
- First step is to get a full response for each date range and to deal
with the XML later. Will need pulling out, aggregating and maybe
turning into a much larger document per period.
- Conversion to something else can wait! :)

Ben

Ben

On 19 January 2011 09:45, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> There was some discussion last week that we may have need for a
> distributed scraping setup for a large scrape of data that we would
> like to do.
>
> Friedrich (in cc) has developed FlockScrape:
>
> <http://flockscrape.pudo.org/>
>
> Originally for scraping company info in Germany. Looks like this could
> be useful for us.
>
> Rufus
>