[pdb-discuss] british library screen scraping

Mon Jun 19 21:43:53 UTC 2006

You might want to hit reply-to-all so that you reply goes to the list as 
well as to me when respond. Full comments below.

Nathan Lewis wrote:
> Hi Rufus,
> 
> Not a problem. By the way, there was one bit in the scraper that was 
> hard coded to the particular page it was looking at, that was the 
> foreach my $base (1,21,41,61,81,101,121)  bit. Those numbers were 
> supplied by me rather than deduced by the code. All that should be

Yes I know. I've also still got that hardcoded.

> needed is to read those numbers from the page and then you / we can pull 
> data for every year desired.

As you say we could just do some minimal parsing of the original results 
page to work out what the maximal number. An alternative, that might be 
simpler, would be to just keep going until you get an exception e.g.

count = 0
while(True):
     base = count * 20 # 20 is results per page
     ...
     for ii in range(20):
         ii += base
         ....
         # eventually get an exception when you run out of results

> Are you going to use that Gauss thing for hosting it?

Not sure yet. As a start might be easier just to write some code to auto 
create wiki pages rather than using the virtual page -- virtual pages 
can't be edited :(

> By the way could I look at your Python code?

Please do. I've just commited it into the public subversion project 
repository (along with some of your code) whose url is:

   http://project.knowledgeforge.net/pdw/svn/

(nb: you can just browse this using your web browser)

You want to look in trunk, specifically at:

http://project.knowledgeforge.net/pdw/svn/trunk/bin/sound_archive_crawl.py

You're *very* welcome to use the repository too, all you need to do is 
sign up for an account on https://www.knowledgeforge.net/, tell me your 
username (so I can give you commit access) and then use a subversion 
client to check-out the repository.

Regards,

Rufus

> On Jun 17, 2006, at 10:58 PM, Rufus Pollock wrote:
> 
>> Dear Nathan,
>>
>> Thanks a lot -- you're a real star! After some painful digging around 
>> in the internals of python's mechanize (the benefits of opensource) i 
>> finally fixed the remaining bugs and got a python version of your 
>> scraper working today so I now have something with which to feed the 
>> parser. The aim is to get something up by the end of next week when 
>> Tom is going to talk about PD Burn at the iCommons summit in Rio.
>>
>> Regards,
>>
>> Rufus
>>
>> Nathan Lewis wrote:
>>
>>> Hi Rufus,
>>> Here you go
>>> Nathan
>>> On Jun 16, 2006, at 6:32 PM, Rufus Pollock wrote:
>>>
>>>> Dear Nathan,
>>>>
>>>> Could you post your 'parser' code that you use to extract the work 
>>>> metadata from the html files as I'd like to try porting to python 
>>>> (my perl's terrible ...). Thanks once again in advance.
>>>>
>>>> Regards,
>>>>
>>>> Rufus
>>>>
>>>> Nathan Lewis wrote:
>>>>
>>>>> Hi Rufus,
>>>>> I wrote the code to explicitely pull out both the author and the 
>>>>> performer from the pages. There is an html comment in the pages 
>>>>> <!-- Print the author, if one exists --> that I used to find the 
>>>>> author. I think it was just that the library pages listed the 
>>>>> author as one of the performers. Could you do a random check?
>>>>> Personally I would advocate defining the db schema next because it 
>>>>> would make setting up a web interface much easier.
>>>>> Matching up the authors sounds to me like the most difficult task. 
>>>>> I would be surprised if Wikipedia had entries for more than a 
>>>>> handful of them and programatically matching up names is always 
>>>>> harder than you expect. I think I will limit my involvement to coding.
>>>>> Cheers,
>>>>> Nathan
>>>>> On Apr 20, 2006, at 8:52 AM, Rufus Pollock wrote:
>>>>>
>>>>>> Nathan Lewis wrote:
>>>>>>
>>>>>>> Ok, here it is as a flat file. This was produced using the 
>>>>>>> Data::Dumper module in perl but the output is very readable if 
>>>>>>> not pretty. This is the info for the 136 recordings from the year 
>>>>>>> 1900.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sorry for the delay in replying - I'm away from the net most of 
>>>>>> the time this week. Anyway I have now taken a proper look and this 
>>>>>> looks really good -- I can only say once again: great work! My 
>>>>>> only correction is that I think the data dump erroneously reuses 
>>>>>> the performer as author: looking at the BL search results it is 
>>>>>> hard to identify the author in general (where shown it seems 
>>>>>> integrated into the title e.g. 'Faust (Act 4)/Gounod' -- Gounod is 
>>>>>> the author).
>>>>>>
>>>>>> As you mentioned in a follow up mail our next task would be to 
>>>>>> start cross correlating works with author (i.e. composer) -- esp. 
>>>>>> necessary where author not given -- and then find birth/death 
>>>>>> dates for these people (maybe using wikipedia). However this is 
>>>>>> something that might have to be done by hand.
>>>>>>
>>>>>> Anyway we have made a fantastic start and now that we know we have 
>>>>>> a data source our next task need to get a move on with the web 
>>>>>> interface so we can start editing/browsing the data we have. this 
>>>>>> in turn will define our db schema and we can then customize the 
>>>>>> perl to dump our results straight in to the db.
>>>>>>
>>>>>> ~rufus
>>>>>>
>>>>>>> On Apr 14, 2006, at 9:42 PM, Rufus Pollock wrote:
>>>>>>>
>>>>>>>> Nathan Lewis wrote:
>>>>>>>>
>>>>>>>>> Hi Rufus,
>>>>>>>>> I don't know python well enough to fix your code though it does 
>>>>>>>>> look quite similar. I suspect the python's mechanize works 
>>>>>>>>> differently to WWW::Mechanize in perl. Anyway I will continue 
>>>>>>>>> with mine since there
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I suspect so too :). Anyway I learnt plenty from porting (having 
>>>>>>>> never used mechanize before ...)
>>>>>>>>
>>>>>>>>> isn't much left to do. I am running mysql 5.0 here but it 
>>>>>>>>> should be
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> please do. You're doing a great job.
>>>>>>>>
>>>>>>>>> importable even if you are running an older version.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> mysql dump should be fine though flat files might be even easier.
>>>>>>>>
>>>>>>>>> But one question, do we want to search on other years? Assuming 
>>>>>>>>> we do, what range? What is the most recent year something could 
>>>>>>>>> fall out of UK copyright?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> We want all years up until 1955 (frankly we could do with *all* 
>>>>>>>> of the data). However the number of works seems to grow 
>>>>>>>> *rapidly**, e.g. for 1954 i think there are over 130,000 works. 
>>>>>>>> Thus for the time being I'd suggest we could just practive on 
>>>>>>>> 1900 (or if we want a bit more say, 1900-1910). The best thing 
>>>>>>>> is to probably make the script configurable (e.g. we can pass it 
>>>>>>>> a date range).
>>>>>>>>
>>>>>>>> Thanks for your efforts on this. We're making great progress.
>>>>>>>>
>>>>>>>> ~rufus
>>>>>>>>
>