[pdb-discuss] british library screen scraping

Sat Jun 17 21:58:27 UTC 2006

Dear Nathan,

Thanks a lot -- you're a real star! After some painful digging around in 
the internals of python's mechanize (the benefits of opensource) i 
finally fixed the remaining bugs and got a python version of your 
scraper working today so I now have something with which to feed the 
parser. The aim is to get something up by the end of next week when Tom 
is going to talk about PD Burn at the iCommons summit in Rio.

Regards,

Rufus

Nathan Lewis wrote:
> 
> Hi Rufus,
> 
> Here you go
> 
> 
> Nathan
> 
> 
> On Jun 16, 2006, at 6:32 PM, Rufus Pollock wrote:
> 
>> Dear Nathan,
>>
>> Could you post your 'parser' code that you use to extract the work 
>> metadata from the html files as I'd like to try porting to python (my 
>> perl's terrible ...). Thanks once again in advance.
>>
>> Regards,
>>
>> Rufus
>>
>> Nathan Lewis wrote:
>>
>>> Hi Rufus,
>>> I wrote the code to explicitely pull out both the author and the 
>>> performer from the pages. There is an html comment in the pages <!-- 
>>> Print the author, if one exists --> that I used to find the author. I 
>>> think it was just that the library pages listed the author as one of 
>>> the performers. Could you do a random check?
>>> Personally I would advocate defining the db schema next because it 
>>> would make setting up a web interface much easier.
>>> Matching up the authors sounds to me like the most difficult task. I 
>>> would be surprised if Wikipedia had entries for more than a handful 
>>> of them and programatically matching up names is always harder than 
>>> you expect. I think I will limit my involvement to coding.
>>> Cheers,
>>> Nathan
>>> On Apr 20, 2006, at 8:52 AM, Rufus Pollock wrote:
>>>
>>>> Nathan Lewis wrote:
>>>>
>>>>> Ok, here it is as a flat file. This was produced using the 
>>>>> Data::Dumper module in perl but the output is very readable if not 
>>>>> pretty. This is the info for the 136 recordings from the year 1900.
>>>>
>>>>
>>>>
>>>> Sorry for the delay in replying - I'm away from the net most of the 
>>>> time this week. Anyway I have now taken a proper look and this looks 
>>>> really good -- I can only say once again: great work! My only 
>>>> correction is that I think the data dump erroneously reuses the 
>>>> performer as author: looking at the BL search results it is hard to 
>>>> identify the author in general (where shown it seems integrated into 
>>>> the title e.g. 'Faust (Act 4)/Gounod' -- Gounod is the author).
>>>>
>>>> As you mentioned in a follow up mail our next task would be to start 
>>>> cross correlating works with author (i.e. composer) -- esp. 
>>>> necessary where author not given -- and then find birth/death dates 
>>>> for these people (maybe using wikipedia). However this is something 
>>>> that might have to be done by hand.
>>>>
>>>> Anyway we have made a fantastic start and now that we know we have a 
>>>> data source our next task need to get a move on with the web 
>>>> interface so we can start editing/browsing the data we have. this in 
>>>> turn will define our db schema and we can then customize the perl to 
>>>> dump our results straight in to the db.
>>>>
>>>> ~rufus
>>>>
>>>>> On Apr 14, 2006, at 9:42 PM, Rufus Pollock wrote:
>>>>>
>>>>>> Nathan Lewis wrote:
>>>>>>
>>>>>>> Hi Rufus,
>>>>>>> I don't know python well enough to fix your code though it does 
>>>>>>> look quite similar. I suspect the python's mechanize works 
>>>>>>> differently to WWW::Mechanize in perl. Anyway I will continue 
>>>>>>> with mine since there
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I suspect so too :). Anyway I learnt plenty from porting (having 
>>>>>> never used mechanize before ...)
>>>>>>
>>>>>>> isn't much left to do. I am running mysql 5.0 here but it should be
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> please do. You're doing a great job.
>>>>>>
>>>>>>> importable even if you are running an older version.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> mysql dump should be fine though flat files might be even easier.
>>>>>>
>>>>>>> But one question, do we want to search on other years? Assuming 
>>>>>>> we do, what range? What is the most recent year something could 
>>>>>>> fall out of UK copyright?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> We want all years up until 1955 (frankly we could do with *all* of 
>>>>>> the data). However the number of works seems to grow *rapidly**, 
>>>>>> e.g. for 1954 i think there are over 130,000 works. Thus for the 
>>>>>> time being I'd suggest we could just practive on 1900 (or if we 
>>>>>> want a bit more say, 1900-1910). The best thing is to probably 
>>>>>> make the script configurable (e.g. we can pass it a date range).
>>>>>>
>>>>>> Thanks for your efforts on this. We're making great progress.
>>>>>>
>>>>>> ~rufus
>>>>>>