[pdb-discuss] british library screen scraping

Wed Apr 12 22:01:08 UTC 2006

Hi Everyone,

So I have a scraper written to save out the details pages for each 
recording listed in the year 1900 search on the british library.  The 
code is below. I haven't written the html parsing yet to parse the 
saved pages. That will be next.

Before doing that what kind of database are you using? Would a 
mysqldump be useful?

The code is below. You should need only to have perl and the 
WWW::Mechanize CPAN module installed to run it.

Cheers,

Nathan

#!/usr/bin/perl

use WWW::Mechanize;

my $mech = WWW::Mechanize->new();

my $url = 'http://cadensa.bl.uk/cgi-bin/webcat';
$mech->agent('FireFox');    # to be a little less obvious
$mech->get( $url );

$mech->follow_link( text_regex => qr/Advanced search/);

#print $mech->content;

# form begins on line 735 of the html
$mech->submit_form(
         form_name => 'searchform',
         fields => { pubyear => 1900 },
         # enters the year 1900 and submits the form
);

foreach my $base (1,21,41,61,81,101,121) {

         if( $base > 1 ) {
                 $mech->submit_form(
                         form_name => 'hitlist',
                         fields => { form_type => "JUMP^$base" },
                 );
         }
         foreach my $i (0..19) {
                 $i += $base;
                 $mech->submit_form(
                         form_name => 'hitlist',
                         button => "VIEW^$i",
                 );

                 if( open( my $fh, ">page$i.html" ) ) {
                         print $fh $mech->content;
                         close $fh;
                 } else {
                         print $mech->content; next
                 }
                 sleep 1         if( $i % 2);    # give the server a rest
                 $mech->back();
         }
}