[pdb-discuss] british library screen scraping

Fri Apr 14 20:42:24 UTC 2006

Nathan Lewis wrote:
> 
> Hi Rufus,
> 
> I don't know python well enough to fix your code though it does look 
> quite similar. I suspect the python's mechanize works differently to 
> WWW::Mechanize in perl. Anyway I will continue with mine since there 

I suspect so too :). Anyway I learnt plenty from porting (having never 
used mechanize before ...)

> isn't much left to do. I am running mysql 5.0 here but it should be 

please do. You're doing a great job.

> importable even if you are running an older version.

mysql dump should be fine though flat files might be even easier.

> But one question, do we want to search on other years? Assuming we do, 
> what range? What is the most recent year something could fall out of UK 
> copyright?

We want all years up until 1955 (frankly we could do with *all* of the 
data). However the number of works seems to grow *rapidly**, e.g. for 
1954 i think there are over 130,000 works. Thus for the time being I'd 
suggest we could just practive on 1900 (or if we want a bit more say, 
1900-1910). The best thing is to probably make the script configurable 
(e.g. we can pass it a date range).

Thanks for your efforts on this. We're making great progress.

~rufus

> On Apr 13, 2006, at 11:16 PM, Rufus Pollock wrote:
> 
>> Great work Nathan! Unfortunately I am useless at perl so my port to 
>> python is below. Having run my version a bit I don't think everything 
>> is working quite right (among other things the br.back() seems to 
>> break the session so I only get one run through -- though this could 
>> be fixed by moving the for loop to the outside).
>>
>> Of course if you have already got this working this is rather academic 
>> and we can move on to analyzing the individual pages. If you already 
>> have a cache just post a  mysqldump or the tarred set of files up on a 
>> page somewhere and send the link.
>>
>> ~rufus
>>
>> #!/usr/bin/env python
>> import time
>> import re
>> from mechanize import Browser
>>
>> url = 'http://cadensa.bl.uk/cgi-bin/webcat'
>> year = 1900
>> verbose = True
>> def _print(msg):
>>     if verbose:
>>         print(msg)
>>
>> def main():
>>     br = Browser()
>>     br.addheaders = [ ('User-agent', 'Firefox') ]
>>     br.set_handle_robots(False)
>>     _print('Opening url: %s' % url)
>>     br.open(url)
>>
>>     # follow second link with element text matching regular expression
>>     _print('Going to Advanced search page')
>>     br.follow_link(text = 'Advanced search')
>>     assert br.viewing_html()
>>
>>     br.select_form(name='searchform')
>>     _print('Getting results for year %s' % year)
>>     br['pubyear'] = str(year)
>>     br.submit()
>>
>>     for base in (1,21,41,61,81,101,121):
>>         if base > 1:
>>             br.select_form('hitlist')
>>             br['form_type'] = 'JUMP^%s' % base
>>         for ii in range(19):
>>             ii += base
>>             br.select_form('hitlist')
>>             br.button = 'VIEW^%s' % ii
>>             response = br.submit()
>>
>>             # TODO: save page
>>             _print('Saving page: %s' % response.geturl())
>>             _print('*****************************')
>>             _print(response.read())
>>             _print('*****************************')
>>
>>             time.sleep(1)  # give server a break
>>             br.back()
>>             _print('After back page is: %s' % br.geturl())
>>
>> if __name__ == '__main__':
>>     main()
>>
>> Nathan Lewis wrote:
>>
>>> Hi Everyone,
>>> So I have a scraper written to save out the details pages for each 
>>> recording listed in the year 1900 search on the british library.  The 
>>> code is below. I haven't written the html parsing yet to parse the 
>>> saved pages. That will be next.
>>> Before doing that what kind of database are you using? Would a 
>>> mysqldump be useful?
>>> The code is below. You should need only to have perl and the 
>>> WWW::Mechanize CPAN module installed to run it.
>>> Cheers,
>>> Nathan
>>> #!/usr/bin/perl
>>> use WWW::Mechanize;
>>> my $mech = WWW::Mechanize->new();
>>> my $url = 'http://cadensa.bl.uk/cgi-bin/webcat';
>>> $mech->agent('FireFox');    # to be a little less obvious
>>> $mech->get( $url );
>>> $mech->follow_link( text_regex => qr/Advanced search/);
>>> #print $mech->content;
>>> # form begins on line 735 of the html
>>> $mech->submit_form(
>>>         form_name => 'searchform',
>>>         fields => { pubyear => 1900 },
>>>         # enters the year 1900 and submits the form
>>> );
>>> foreach my $base (1,21,41,61,81,101,121) {
>>>         if( $base > 1 ) {
>>>                 $mech->submit_form(
>>>                         form_name => 'hitlist',
>>>                         fields => { form_type => "JUMP^$base" },
>>>                 );
>>>         }
>>>         foreach my $i (0..19) {
>>>                 $i += $base;
>>>                 $mech->submit_form(
>>>                         form_name => 'hitlist',
>>>                         button => "VIEW^$i",
>>>                 );
>>>                 if( open( my $fh, ">page$i.html" ) ) {
>>>                         print $fh $mech->content;
>>>                         close $fh;
>>>                 } else {
>>>                         print $mech->content; next
>>>                 }
>>>                 sleep 1         if( $i % 2);    # give the server a rest
>>>                 $mech->back();
>>>         }
>>> }
> 
>