[pdb-discuss] british library screen scraping
Rufus Pollock
rufus.pollock at okfn.org
Thu Apr 13 22:16:29 UTC 2006
Great work Nathan! Unfortunately I am useless at perl so my port to
python is below. Having run my version a bit I don't think everything is
working quite right (among other things the br.back() seems to break the
session so I only get one run through -- though this could be fixed by
moving the for loop to the outside).
Of course if you have already got this working this is rather academic
and we can move on to analyzing the individual pages. If you already
have a cache just post a mysqldump or the tarred set of files up on a
page somewhere and send the link.
~rufus
#!/usr/bin/env python
import time
import re
from mechanize import Browser
url = 'http://cadensa.bl.uk/cgi-bin/webcat'
year = 1900
verbose = True
def _print(msg):
if verbose:
print(msg)
def main():
br = Browser()
br.addheaders = [ ('User-agent', 'Firefox') ]
br.set_handle_robots(False)
_print('Opening url: %s' % url)
br.open(url)
# follow second link with element text matching regular expression
_print('Going to Advanced search page')
br.follow_link(text = 'Advanced search')
assert br.viewing_html()
br.select_form(name='searchform')
_print('Getting results for year %s' % year)
br['pubyear'] = str(year)
br.submit()
for base in (1,21,41,61,81,101,121):
if base > 1:
br.select_form('hitlist')
br['form_type'] = 'JUMP^%s' % base
for ii in range(19):
ii += base
br.select_form('hitlist')
br.button = 'VIEW^%s' % ii
response = br.submit()
# TODO: save page
_print('Saving page: %s' % response.geturl())
_print('*****************************')
_print(response.read())
_print('*****************************')
time.sleep(1) # give server a break
br.back()
_print('After back page is: %s' % br.geturl())
if __name__ == '__main__':
main()
Nathan Lewis wrote:
>
> Hi Everyone,
>
> So I have a scraper written to save out the details pages for each
> recording listed in the year 1900 search on the british library. The
> code is below. I haven't written the html parsing yet to parse the saved
> pages. That will be next.
>
> Before doing that what kind of database are you using? Would a mysqldump
> be useful?
>
>
> The code is below. You should need only to have perl and the
> WWW::Mechanize CPAN module installed to run it.
>
> Cheers,
>
> Nathan
>
>
> #!/usr/bin/perl
>
> use WWW::Mechanize;
>
> my $mech = WWW::Mechanize->new();
>
> my $url = 'http://cadensa.bl.uk/cgi-bin/webcat';
> $mech->agent('FireFox'); # to be a little less obvious
> $mech->get( $url );
>
> $mech->follow_link( text_regex => qr/Advanced search/);
>
> #print $mech->content;
>
> # form begins on line 735 of the html
> $mech->submit_form(
> form_name => 'searchform',
> fields => { pubyear => 1900 },
> # enters the year 1900 and submits the form
> );
>
> foreach my $base (1,21,41,61,81,101,121) {
>
> if( $base > 1 ) {
> $mech->submit_form(
> form_name => 'hitlist',
> fields => { form_type => "JUMP^$base" },
> );
> }
> foreach my $i (0..19) {
> $i += $base;
> $mech->submit_form(
> form_name => 'hitlist',
> button => "VIEW^$i",
> );
>
> if( open( my $fh, ">page$i.html" ) ) {
> print $fh $mech->content;
> close $fh;
> } else {
> print $mech->content; next
> }
> sleep 1 if( $i % 2); # give the server a rest
> $mech->back();
> }
> }
>
More information about the pd-discuss
mailing list