[openbiblio-dev] Distributed scraping using flockscrape
Ben O'Steen
bosteen at gmail.com
Fri Jan 21 12:32:32 UTC 2011
A 'ListIdentifiers' pass might not be a bad idea at all - would give
us all the oai identifiers at a low cost to the service. We may then
be able to generate resumption tokens to get around bad data. Im
guessing that the token logic is "Get 25 records, starting with
id=.... ordered by date and dont go past 'until-date'"
Batch-loading is a good thing to bring up - maybe chunking per-day for
the ListIdentifiers request to find any potential hotspots?
Worthwhile to point out to everyone else, that the formats provided by
PMC are not like normal OAI formats (they aren't intended to be
equivalent forms of the same data) - metadataPrefix of "pmc_fm" is
meant to be all the record's metadata, and "pmc" is meant to be the
full-text of the articles.
On 21 January 2011 12:15, ianibbo at gmail.com <ianibbo at gmail.com> wrote:
> Just wondering Ben...
>
> Do you think it would cope any better with just doing a
> listIdentifiers first, then following that with a list formats, and an
> oai-get per available prefix? Might also help by not confusing chunks
> when one of the constituent records is badly formed? We've had some
> success with this approach on bad targets in the past.. Also
> dynamically shrinking the date range on big chunks.. That works until
> someone batch loads 10000 records in one day and you're hosed.
>
> I'm kind of interested in tools and techniques to rescue data from bad
> OAI servers generally, so might have a go at this with some groovy
> tools I've been building.. Anyone wanna race :P
>
> Cheers
> Ian.
>
> On 21 January 2011 12:03, Ben O'Steen <bosteen at gmail.com> wrote:
>> Key to scraping PMC is OAIPMH - meaning, to be efficient, we need to
>> divide up the 'history' of the site into chunks and to start the
>> sequential harvesting.
>>
>> The dates that the OAIPMH service API uses are the 'aquisition' dates,
>> the date at which a given record appeared in PMC's db. Note that this
>> has nothing to do with publication date!
>>
>> You ask the API for access to the records that appeared within a date
>> range, and it passes back a partial set of records and a
>> 'resumptionToken'. You need this token to construct the next URL to
>> get the next part of the set for the asked for date range. When you
>> dont get a resumptionToken in the response, you should have all the
>> records you asked for.
>>
>> The reason why we need to chunk up by date is to allow for more than
>> one 'client' to start one of these sequential requests and also, in
>> anticipation that for some periods, we may hit server errors, bad XML
>> and so on, things which dont allow us to get the next resumption
>> token. (From my experience of it, it certainly uses perl and string
>> concatenation to build up the responses - you absolutely cannot
>> guarantee valid XML!)
>>
>> If we can manufacture a task that can deal with the two modes of
>> download (initial request and resumption requests)
>>
>> Some figures:
>> PubMed Central claims to have started in 2000 - that would make for
>> 286 two-week chunks
>> Oft-repeated phrase is that an article is added to Pubmed every three seconds.
>>
>> URLs:
>>
>> Initial request looks like: (Caps signify variables)
>>
>> http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=pmc&from=FROMDATE&until=TODATE
>>
>> http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&resumptionToken=RESUMPTIONTOKEN
>> (That's the official way - we may need to stick in the from and until
>> params too, if the service gets confused on certain ranges)
>>
>> The resumption tokens coming from the service seem to encode the last
>> internal article id you got to and the dates - not entirely guessable
>> as the ids are erratic and no doubt used in a db query. (incidentally,
>> a likely db injection vector too... but I digress)
>>
>> Response:
>> Should be an XML response that sort of looks like this:
>>
>> <OAIPMH ...>
>> ...
>> <ListRecords>
>> <record>
>> <metadata> the meat of the record response is in here. Should
>> be a chunk of NLM DTD xml.</metadata>
>> ...
>> </record>
>> <record>
>> ....
>> <!-- optionally: -->
>> <resumptionToken>Magiccodegoeshere</resumptionToken>
>> </ListRecords>
>> </OAIPMH>
>>
>> Errors:
>> We'll probably hit server errors, blocks and so on, but there are some
>> OAIPMH specific error responses to watch out for:
>>
>> OAIPMH Error looks something like:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>> xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
>> http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
>> <responseDate>2002-05-01T09:18:29Z</responseDate>
>> <request>http://arXiv.org/oai2</request>
>> <error code="ERRORCODE">Error Description</error>
>> </OAI-PMH>
>>
>> The ERRORCODEs to watch out for:
>>
>> "badResumptionToken" - The service doesn't like the resumption token
>> which it no doubt just gave you
>> "noRecordsMatch" - There's no records for that given combo of from +
>> until. Doesn't mean that the records don't exist, just that the
>> service can't find it in the db. Could be a timeout or something
>> similar.
>>
>>
>> Suggestions:
>>
>> - Use a regex to get the resumptionToken - can't guarantee that the
>> XML won't choke a parser (without BeautifulStoneSoup perhaps)
>> - Use 2 week chunks, as we may find dates that the service fails to
>> supply. Hard to overlap data from a failed range without working out
>> how far it got in terms of aquisition date.
>> - First step is to get a full response for each date range and to deal
>> with the XML later. Will need pulling out, aggregating and maybe
>> turning into a much larger document per period.
>> - Conversion to something else can wait! :)
>>
>> Ben
>>
>>
>>
>> Ben
>>
>> On 19 January 2011 09:45, Rufus Pollock <rufus.pollock at okfn.org> wrote:
>>> There was some discussion last week that we may have need for a
>>> distributed scraping setup for a large scrape of data that we would
>>> like to do.
>>>
>>> Friedrich (in cc) has developed FlockScrape:
>>>
>>> <http://flockscrape.pudo.org/>
>>>
>>> Originally for scraping company info in Germany. Looks like this could
>>> be useful for us.
>>>
>>> Rufus
>>>
>>
>> _______________________________________________
>> openbiblio-dev mailing list
>> openbiblio-dev at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/openbiblio-dev
>>
>
>
>
> --
> Ian Ibbotson
> W: http://ianibbo.me
> E: ianibbo at gmail.com
> skype: ianibbo
> twitter: ianibbo
>
More information about the openbiblio-dev
mailing list