[pdb-discuss] british library screen scraping

Rufus Pollock rufus.pollock at okfn.org
Thu Apr 13 22:16:29 UTC 2006


Great work Nathan! Unfortunately I am useless at perl so my port to 
python is below. Having run my version a bit I don't think everything is 
working quite right (among other things the br.back() seems to break the 
session so I only get one run through -- though this could be fixed by 
moving the for loop to the outside).

Of course if you have already got this working this is rather academic 
and we can move on to analyzing the individual pages. If you already 
have a cache just post a  mysqldump or the tarred set of files up on a 
page somewhere and send the link.

~rufus

#!/usr/bin/env python
import time
import re
from mechanize import Browser

url = 'http://cadensa.bl.uk/cgi-bin/webcat'
year = 1900
verbose = True
def _print(msg):
     if verbose:
         print(msg)

def main():
     br = Browser()
     br.addheaders = [ ('User-agent', 'Firefox') ]
     br.set_handle_robots(False)
     _print('Opening url: %s' % url)
     br.open(url)

     # follow second link with element text matching regular expression
     _print('Going to Advanced search page')
     br.follow_link(text = 'Advanced search')
     assert br.viewing_html()

     br.select_form(name='searchform')
     _print('Getting results for year %s' % year)
     br['pubyear'] = str(year)
     br.submit()

     for base in (1,21,41,61,81,101,121):
         if base > 1:
             br.select_form('hitlist')
             br['form_type'] = 'JUMP^%s' % base
         for ii in range(19):
             ii += base
             br.select_form('hitlist')
             br.button = 'VIEW^%s' % ii
             response = br.submit()

             # TODO: save page
             _print('Saving page: %s' % response.geturl())
             _print('*****************************')
             _print(response.read())
             _print('*****************************')

             time.sleep(1)  # give server a break
             br.back()
             _print('After back page is: %s' % br.geturl())

if __name__ == '__main__':
     main()

Nathan Lewis wrote:
> 
> Hi Everyone,
> 
> So I have a scraper written to save out the details pages for each 
> recording listed in the year 1900 search on the british library.  The 
> code is below. I haven't written the html parsing yet to parse the saved 
> pages. That will be next.
> 
> Before doing that what kind of database are you using? Would a mysqldump 
> be useful?
> 
> 
> The code is below. You should need only to have perl and the 
> WWW::Mechanize CPAN module installed to run it.
> 
> Cheers,
> 
> Nathan
> 
> 
> #!/usr/bin/perl
> 
> use WWW::Mechanize;
> 
> my $mech = WWW::Mechanize->new();
> 
> my $url = 'http://cadensa.bl.uk/cgi-bin/webcat';
> $mech->agent('FireFox');    # to be a little less obvious
> $mech->get( $url );
> 
> $mech->follow_link( text_regex => qr/Advanced search/);
> 
> #print $mech->content;
> 
> # form begins on line 735 of the html
> $mech->submit_form(
>         form_name => 'searchform',
>         fields => { pubyear => 1900 },
>         # enters the year 1900 and submits the form
> );
> 
> foreach my $base (1,21,41,61,81,101,121) {
> 
>         if( $base > 1 ) {
>                 $mech->submit_form(
>                         form_name => 'hitlist',
>                         fields => { form_type => "JUMP^$base" },
>                 );
>         }
>         foreach my $i (0..19) {
>                 $i += $base;
>                 $mech->submit_form(
>                         form_name => 'hitlist',
>                         button => "VIEW^$i",
>                 );
> 
>                 if( open( my $fh, ">page$i.html" ) ) {
>                         print $fh $mech->content;
>                         close $fh;
>                 } else {
>                         print $mech->content; next
>                 }
>                 sleep 1         if( $i % 2);    # give the server a rest
>                 $mech->back();
>         }
> }
> 




More information about the pd-discuss mailing list