[openbiblio-dev] Distributed scraping using flockscrape

Ben O'Steen bosteen at gmail.com
Fri Jan 21 12:35:20 UTC 2011


Just checked previous info -  I had 'pmc' in the URLs... whoops
(copied and pasted from a pass I attempted at the OA subset)

Should be:

URLs:

Initial request looks like: (Caps signify variables)

http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=pmc_fm&from=FROMDATE&until=TODATE

http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&resumptionToken=RESUMPTIONTOKEN

Ben

On 21 January 2011 12:32, Ben O'Steen <bosteen at gmail.com> wrote:
> A 'ListIdentifiers' pass might not be a bad idea at all - would give
> us all the oai identifiers at a low cost to the service. We may then
> be able to generate resumption tokens to get around bad data. Im
> guessing that the token logic is "Get 25 records, starting with
> id=.... ordered by date and dont go past 'until-date'"
>
> Batch-loading is a good thing to bring up - maybe chunking per-day for
> the ListIdentifiers request to find any potential hotspots?
>
> Worthwhile to point out to everyone else, that the formats provided by
> PMC are not like normal OAI formats (they aren't intended to be
> equivalent forms of the same data) - metadataPrefix of "pmc_fm" is
> meant to be all the record's metadata, and "pmc" is meant to be the
> full-text of the articles.
>
>
>
> On 21 January 2011 12:15, ianibbo at gmail.com <ianibbo at gmail.com> wrote:
>> Just wondering Ben...
>>
>> Do you think it would cope any better with just doing a
>> listIdentifiers first, then following that with a list formats, and an
>> oai-get per available prefix? Might also help by not confusing chunks
>> when one of the constituent records is badly formed? We've had some
>> success with this approach on bad targets in the past.. Also
>> dynamically shrinking the date range on big chunks.. That works until
>> someone batch loads 10000 records in one day and you're hosed.
>>
>> I'm kind of interested in tools and techniques to rescue data from bad
>> OAI servers generally, so might have a go at this with some groovy
>> tools I've been building.. Anyone wanna race :P
>>
>> Cheers
>> Ian.
>>
>> On 21 January 2011 12:03, Ben O'Steen <bosteen at gmail.com> wrote:
>>> Key to scraping PMC is OAIPMH - meaning, to be efficient, we need to
>>> divide up the 'history' of the site into chunks and to start the
>>> sequential harvesting.
>>>
>>> The dates that the OAIPMH service API uses are the 'aquisition' dates,
>>> the date at which a given record appeared in PMC's db. Note that this
>>> has nothing to do with publication date!
>>>
>>> You ask the API for access to the records that appeared within a date
>>> range, and it passes back a partial set of records and a
>>> 'resumptionToken'. You need this token to construct the next URL to
>>> get the next part of the set for the asked for date range. When you
>>> dont get a resumptionToken in the response, you should have all the
>>> records you asked for.
>>>
>>> The reason why we need to chunk up by date is to allow for more than
>>> one 'client' to start one of these sequential requests and also, in
>>> anticipation that for some periods, we may hit server errors, bad XML
>>> and so on, things which dont allow us to get the next resumption
>>> token. (From my experience of it, it certainly uses perl and string
>>> concatenation to build up the responses - you absolutely cannot
>>> guarantee valid XML!)
>>>
>>> If we can manufacture a task that can deal with the two modes of
>>> download (initial request and resumption requests)
>>>
>>> Some figures:
>>> PubMed Central claims to have started in 2000 - that would make for
>>> 286 two-week chunks
>>> Oft-repeated phrase is that an article is added to Pubmed every three seconds.
>>>
>>> URLs:
>>>
>>> Initial request looks like: (Caps signify variables)
>>>
>>> http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=pmc&from=FROMDATE&until=TODATE
>>>
>>> http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&resumptionToken=RESUMPTIONTOKEN
>>> (That's the official way - we may need to stick in the from and until
>>> params too, if the service gets confused on certain ranges)
>>>
>>> The resumption tokens coming from the service seem to encode the last
>>> internal article id you got to and the dates - not entirely guessable
>>> as the ids are erratic and no doubt used in a db query. (incidentally,
>>> a likely db injection vector too... but I digress)
>>>
>>> Response:
>>> Should be an XML response that sort of looks like this:
>>>
>>> <OAIPMH ...>
>>> ...
>>>  <ListRecords>
>>>     <record>
>>>        <metadata>  the meat of the record response is in here. Should
>>> be a chunk of NLM DTD xml.</metadata>
>>>        ...
>>>     </record>
>>>     <record>
>>>     ....
>>>     <!--  optionally: -->
>>>     <resumptionToken>Magiccodegoeshere</resumptionToken>
>>>   </ListRecords>
>>> </OAIPMH>
>>>
>>> Errors:
>>> We'll probably hit server errors, blocks and so on, but there are some
>>> OAIPMH specific error responses to watch out for:
>>>
>>> OAIPMH Error looks something like:
>>>
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
>>>         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>>         xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
>>>         http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
>>>  <responseDate>2002-05-01T09:18:29Z</responseDate>
>>>  <request>http://arXiv.org/oai2</request>
>>>  <error code="ERRORCODE">Error Description</error>
>>> </OAI-PMH>
>>>
>>> The ERRORCODEs to watch out for:
>>>
>>> "badResumptionToken" - The service doesn't like the resumption token
>>> which it no doubt just gave you
>>> "noRecordsMatch" - There's no records for that given combo of from +
>>> until. Doesn't mean that the records don't exist, just that the
>>> service can't find it in the db. Could be a timeout or something
>>> similar.
>>>
>>>
>>> Suggestions:
>>>
>>> - Use a regex to get the resumptionToken - can't guarantee that the
>>> XML won't choke a parser (without BeautifulStoneSoup perhaps)
>>> - Use 2 week chunks, as we may find dates that the service fails to
>>> supply. Hard to overlap data from a failed range without working out
>>> how far it got in terms of aquisition date.
>>> - First step is to get a full response for each date range and to deal
>>> with the XML later. Will need pulling out, aggregating and maybe
>>> turning into a much larger document per period.
>>> - Conversion to something else can wait! :)
>>>
>>> Ben
>>>
>>>
>>>
>>> Ben
>>>
>>> On 19 January 2011 09:45, Rufus Pollock <rufus.pollock at okfn.org> wrote:
>>>> There was some discussion last week that we may have need for a
>>>> distributed scraping setup for a large scrape of data that we would
>>>> like to do.
>>>>
>>>> Friedrich (in cc) has developed FlockScrape:
>>>>
>>>> <http://flockscrape.pudo.org/>
>>>>
>>>> Originally for scraping company info in Germany. Looks like this could
>>>> be useful for us.
>>>>
>>>> Rufus
>>>>
>>>
>>> _______________________________________________
>>> openbiblio-dev mailing list
>>> openbiblio-dev at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/openbiblio-dev
>>>
>>
>>
>>
>> --
>> Ian Ibbotson
>> W: http://ianibbo.me
>> E: ianibbo at gmail.com
>> skype: ianibbo
>> twitter: ianibbo
>>
>




More information about the openbiblio-dev mailing list