[pdb-discuss] british library screen scraping

Tue Apr 11 21:40:48 UTC 2006

FYI: Here is perl code using WWW::Mechanize to go to the british 
library sound archive and search on the year 1900.

#!/usr/bin/perl

use WWW::Mechanize;

my $mech = WWW::Mechanize->new();
my $url = 'http://cadensa.bl.uk/cgi-bin/webcat';
$mech->get( $url );

$mech->follow_link( text_regex => qr/Advanced search/);
$mech->submit_form(
         form_name => 'searchform',
         fields => { pubyear => 1900 },
         # enters the year 1900 and submits the form
);

print $mech->content;
__END__

 From the page retrieved we can easily extract data like

1CD0028844 D1 S1 BD31 SYMPOSIUM
Faust (Act 4)/Gounod
Unnamed Male Chorus

It appears we can also get the full details by following the
<input type="submit" value="Details" name="VIEW^1" id="VIEW1" 
class="itemdetails"> links

I hope this helps,

Nathan