[pdb-discuss] british library screen scraping
Rufus Pollock
rufus.pollock at okfn.org
Fri Apr 14 20:42:24 UTC 2006
Nathan Lewis wrote:
>
> Hi Rufus,
>
> I don't know python well enough to fix your code though it does look
> quite similar. I suspect the python's mechanize works differently to
> WWW::Mechanize in perl. Anyway I will continue with mine since there
I suspect so too :). Anyway I learnt plenty from porting (having never
used mechanize before ...)
> isn't much left to do. I am running mysql 5.0 here but it should be
please do. You're doing a great job.
> importable even if you are running an older version.
mysql dump should be fine though flat files might be even easier.
> But one question, do we want to search on other years? Assuming we do,
> what range? What is the most recent year something could fall out of UK
> copyright?
We want all years up until 1955 (frankly we could do with *all* of the
data). However the number of works seems to grow *rapidly**, e.g. for
1954 i think there are over 130,000 works. Thus for the time being I'd
suggest we could just practive on 1900 (or if we want a bit more say,
1900-1910). The best thing is to probably make the script configurable
(e.g. we can pass it a date range).
Thanks for your efforts on this. We're making great progress.
~rufus
> On Apr 13, 2006, at 11:16 PM, Rufus Pollock wrote:
>
>> Great work Nathan! Unfortunately I am useless at perl so my port to
>> python is below. Having run my version a bit I don't think everything
>> is working quite right (among other things the br.back() seems to
>> break the session so I only get one run through -- though this could
>> be fixed by moving the for loop to the outside).
>>
>> Of course if you have already got this working this is rather academic
>> and we can move on to analyzing the individual pages. If you already
>> have a cache just post a mysqldump or the tarred set of files up on a
>> page somewhere and send the link.
>>
>> ~rufus
>>
>> #!/usr/bin/env python
>> import time
>> import re
>> from mechanize import Browser
>>
>> url = 'http://cadensa.bl.uk/cgi-bin/webcat'
>> year = 1900
>> verbose = True
>> def _print(msg):
>> if verbose:
>> print(msg)
>>
>> def main():
>> br = Browser()
>> br.addheaders = [ ('User-agent', 'Firefox') ]
>> br.set_handle_robots(False)
>> _print('Opening url: %s' % url)
>> br.open(url)
>>
>> # follow second link with element text matching regular expression
>> _print('Going to Advanced search page')
>> br.follow_link(text = 'Advanced search')
>> assert br.viewing_html()
>>
>> br.select_form(name='searchform')
>> _print('Getting results for year %s' % year)
>> br['pubyear'] = str(year)
>> br.submit()
>>
>> for base in (1,21,41,61,81,101,121):
>> if base > 1:
>> br.select_form('hitlist')
>> br['form_type'] = 'JUMP^%s' % base
>> for ii in range(19):
>> ii += base
>> br.select_form('hitlist')
>> br.button = 'VIEW^%s' % ii
>> response = br.submit()
>>
>> # TODO: save page
>> _print('Saving page: %s' % response.geturl())
>> _print('*****************************')
>> _print(response.read())
>> _print('*****************************')
>>
>> time.sleep(1) # give server a break
>> br.back()
>> _print('After back page is: %s' % br.geturl())
>>
>> if __name__ == '__main__':
>> main()
>>
>> Nathan Lewis wrote:
>>
>>> Hi Everyone,
>>> So I have a scraper written to save out the details pages for each
>>> recording listed in the year 1900 search on the british library. The
>>> code is below. I haven't written the html parsing yet to parse the
>>> saved pages. That will be next.
>>> Before doing that what kind of database are you using? Would a
>>> mysqldump be useful?
>>> The code is below. You should need only to have perl and the
>>> WWW::Mechanize CPAN module installed to run it.
>>> Cheers,
>>> Nathan
>>> #!/usr/bin/perl
>>> use WWW::Mechanize;
>>> my $mech = WWW::Mechanize->new();
>>> my $url = 'http://cadensa.bl.uk/cgi-bin/webcat';
>>> $mech->agent('FireFox'); # to be a little less obvious
>>> $mech->get( $url );
>>> $mech->follow_link( text_regex => qr/Advanced search/);
>>> #print $mech->content;
>>> # form begins on line 735 of the html
>>> $mech->submit_form(
>>> form_name => 'searchform',
>>> fields => { pubyear => 1900 },
>>> # enters the year 1900 and submits the form
>>> );
>>> foreach my $base (1,21,41,61,81,101,121) {
>>> if( $base > 1 ) {
>>> $mech->submit_form(
>>> form_name => 'hitlist',
>>> fields => { form_type => "JUMP^$base" },
>>> );
>>> }
>>> foreach my $i (0..19) {
>>> $i += $base;
>>> $mech->submit_form(
>>> form_name => 'hitlist',
>>> button => "VIEW^$i",
>>> );
>>> if( open( my $fh, ">page$i.html" ) ) {
>>> print $fh $mech->content;
>>> close $fh;
>>> } else {
>>> print $mech->content; next
>>> }
>>> sleep 1 if( $i % 2); # give the server a rest
>>> $mech->back();
>>> }
>>> }
>
>
More information about the pd-discuss
mailing list