[openbiblio-dev] Distributed scraping using flockscrape

ianibbo at gmail.com ianibbo at gmail.com
Fri Jan 21 12:15:58 UTC 2011


Just wondering Ben...

Do you think it would cope any better with just doing a
listIdentifiers first, then following that with a list formats, and an
oai-get per available prefix? Might also help by not confusing chunks
when one of the constituent records is badly formed? We've had some
success with this approach on bad targets in the past.. Also
dynamically shrinking the date range on big chunks.. That works until
someone batch loads 10000 records in one day and you're hosed.

I'm kind of interested in tools and techniques to rescue data from bad
OAI servers generally, so might have a go at this with some groovy
tools I've been building.. Anyone wanna race :P

Cheers
Ian.

On 21 January 2011 12:03, Ben O'Steen <bosteen at gmail.com> wrote:
> Key to scraping PMC is OAIPMH - meaning, to be efficient, we need to
> divide up the 'history' of the site into chunks and to start the
> sequential harvesting.
>
> The dates that the OAIPMH service API uses are the 'aquisition' dates,
> the date at which a given record appeared in PMC's db. Note that this
> has nothing to do with publication date!
>
> You ask the API for access to the records that appeared within a date
> range, and it passes back a partial set of records and a
> 'resumptionToken'. You need this token to construct the next URL to
> get the next part of the set for the asked for date range. When you
> dont get a resumptionToken in the response, you should have all the
> records you asked for.
>
> The reason why we need to chunk up by date is to allow for more than
> one 'client' to start one of these sequential requests and also, in
> anticipation that for some periods, we may hit server errors, bad XML
> and so on, things which dont allow us to get the next resumption
> token. (From my experience of it, it certainly uses perl and string
> concatenation to build up the responses - you absolutely cannot
> guarantee valid XML!)
>
> If we can manufacture a task that can deal with the two modes of
> download (initial request and resumption requests)
>
> Some figures:
> PubMed Central claims to have started in 2000 - that would make for
> 286 two-week chunks
> Oft-repeated phrase is that an article is added to Pubmed every three seconds.
>
> URLs:
>
> Initial request looks like: (Caps signify variables)
>
> http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=pmc&from=FROMDATE&until=TODATE
>
> http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&resumptionToken=RESUMPTIONTOKEN
> (That's the official way - we may need to stick in the from and until
> params too, if the service gets confused on certain ranges)
>
> The resumption tokens coming from the service seem to encode the last
> internal article id you got to and the dates - not entirely guessable
> as the ids are erratic and no doubt used in a db query. (incidentally,
> a likely db injection vector too... but I digress)
>
> Response:
> Should be an XML response that sort of looks like this:
>
> <OAIPMH ...>
> ...
>  <ListRecords>
>     <record>
>        <metadata>  the meat of the record response is in here. Should
> be a chunk of NLM DTD xml.</metadata>
>        ...
>     </record>
>     <record>
>     ....
>     <!--  optionally: -->
>     <resumptionToken>Magiccodegoeshere</resumptionToken>
>   </ListRecords>
> </OAIPMH>
>
> Errors:
> We'll probably hit server errors, blocks and so on, but there are some
> OAIPMH specific error responses to watch out for:
>
> OAIPMH Error looks something like:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
>         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>         xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
>         http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
>  <responseDate>2002-05-01T09:18:29Z</responseDate>
>  <request>http://arXiv.org/oai2</request>
>  <error code="ERRORCODE">Error Description</error>
> </OAI-PMH>
>
> The ERRORCODEs to watch out for:
>
> "badResumptionToken" - The service doesn't like the resumption token
> which it no doubt just gave you
> "noRecordsMatch" - There's no records for that given combo of from +
> until. Doesn't mean that the records don't exist, just that the
> service can't find it in the db. Could be a timeout or something
> similar.
>
>
> Suggestions:
>
> - Use a regex to get the resumptionToken - can't guarantee that the
> XML won't choke a parser (without BeautifulStoneSoup perhaps)
> - Use 2 week chunks, as we may find dates that the service fails to
> supply. Hard to overlap data from a failed range without working out
> how far it got in terms of aquisition date.
> - First step is to get a full response for each date range and to deal
> with the XML later. Will need pulling out, aggregating and maybe
> turning into a much larger document per period.
> - Conversion to something else can wait! :)
>
> Ben
>
>
>
> Ben
>
> On 19 January 2011 09:45, Rufus Pollock <rufus.pollock at okfn.org> wrote:
>> There was some discussion last week that we may have need for a
>> distributed scraping setup for a large scrape of data that we would
>> like to do.
>>
>> Friedrich (in cc) has developed FlockScrape:
>>
>> <http://flockscrape.pudo.org/>
>>
>> Originally for scraping company info in Germany. Looks like this could
>> be useful for us.
>>
>> Rufus
>>
>
> _______________________________________________
> openbiblio-dev mailing list
> openbiblio-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/openbiblio-dev
>



-- 
Ian Ibbotson
W: http://ianibbo.me
E: ianibbo at gmail.com
skype: ianibbo
twitter: ianibbo



More information about the openbiblio-dev mailing list