[pdb-discuss] british library screen scraping

Rufus Pollock rufus.pollock at okfn.org
Fri Apr 21 09:39:24 UTC 2006


Forgot to cc this to the list yesterday ...

Nathan Lewis wrote:
> 
> Ok, here it is as a flat file. This was produced using the Data::Dumper 
> module in perl but the output is very readable if not pretty. This is 
> the info for the 136 recordings from the year 1900.

Sorry for the delay in replying - I'm away from the net most of the time
this week. Anyway I have now taken a proper look and this looks really
good -- I can only say once again: great work! My only correction is
that I think the data dump erroneously reuses the performer as author:
looking at the BL search results it is hard to identify the author in
general (where shown it seems integrated into the title e.g. 'Faust (Act
4)/Gounod' -- Gounod is the author).

As you mentioned in a follow up mail our next task would be to start
cross correlating works with author (i.e. composer) -- esp. necessary
where author not given -- and then find birth/death dates for these
people (maybe using wikipedia). However this is something that might
have to be done by hand.

Anyway we have made a fantastic start and now that we know we have a
data source our next task need to get a move on with the web interface
so we can start editing/browsing the data we have. this in turn will
define our db schema and we can then customize the perl to dump our
results straight in to the db.

~rufus

> On Apr 14, 2006, at 9:42 PM, Rufus Pollock wrote:
> 
>> Nathan Lewis wrote:
>>
>>> Hi Rufus,
>>> I don't know python well enough to fix your code though it does look 
>>> quite similar. I suspect the python's mechanize works differently to 
>>> WWW::Mechanize in perl. Anyway I will continue with mine since there
>>
>>
>> I suspect so too :). Anyway I learnt plenty from porting (having never 
>> used mechanize before ...)
>>
>>> isn't much left to do. I am running mysql 5.0 here but it should be
>>
>>
>> please do. You're doing a great job.
>>
>>> importable even if you are running an older version.
>>
>>
>> mysql dump should be fine though flat files might be even easier.
>>
>>> But one question, do we want to search on other years? Assuming we 
>>> do, what range? What is the most recent year something could fall out 
>>> of UK copyright?
>>
>>
>> We want all years up until 1955 (frankly we could do with *all* of the 
>> data). However the number of works seems to grow *rapidly**, e.g. for 
>> 1954 i think there are over 130,000 works. Thus for the time being I'd 
>> suggest we could just practive on 1900 (or if we want a bit more say, 
>> 1900-1910). The best thing is to probably make the script configurable 
>> (e.g. we can pass it a date range).
>>
>> Thanks for your efforts on this. We're making great progress.
>>
>> ~rufus
>>





More information about the pd-discuss mailing list