[pdb-discuss] british library screen scraping

Wed Jun 21 10:17:37 UTC 2006

Nathan Lewis wrote:
> 
> Hi Rufus,
> 
> I actually use subversion every day. Do you have Trac up and running on  
> kforge? I also use it everyday and find useful.

Great. trac is one of the services provided 'out of the box' by 
knowledgeforge.net (it is a two click install) and it is now available at:

   http://project.knowledgeforge.net/pdw/trac

> Could I pin you down on some points? What is the overall plan for  
> providing this data to the public? It looks like you are planning on  
> using django but I am not certain. What database will you be using?

The simple answer is that this is up to us and hasn't been decided. 
There are several options:

1) Build a site using a web application framework such as django. A 
/very/ simple prototype that just used the as-shipped admin interface 
was written back in April (as I wrote to the list then)

2) Build something using an existing wiki system such as moinmoin. This 
would work either by
   a) using virtual page plugin (mentioned previously) we could show 
data from db in wiki very easily. However this would be readonly so a 
user could not edit the data.
   b) write a script to create auto-create wiki pages using our data. 
Advantage is that it is easy to add data and it is a familiar interface. 
Problem with this is data is edited in wiki (so not saved back to a db) 
and hard to do data validation (foreign keys, data types etc). Perhaps 
this problem could be addressed in some manner but it would be quite a 
bit of work and you end up moving towards (1).

3) ... suggestions for other ideas welcome

IMO, in the long run I think we need to do something like (1). However 
this is substantially more complex. As a first iteration and in order to 
get something done by Saturday when Tom is going to demo at iCommons I 
would suggest we work on 2b).

Whatever the case I think we need to extract the data to a db. Once 
there we can decide what we do next.
   1. Extract data from site (DONE -- thanks to your efforts)
   2. Parse data from web pages (DONE -- again thanks to you)
   3. Put data in a db (I've nearly finished coding this)

DB Structure
============

As I said in my last email in April I think we want a structure similar to:

http://project.knowledgeforge.net/pdw/svn/trunk/src/pdw/django/models/pdworks.py

which has:

Artist
Work
   * hasMany Author [Artist]
Performance
   * hasA Work
   * hasMany Performers [Artist]

I'd now suggest renaming Performance -> Recording

> I was ready and willing to finish the job of making the british library  
> screen scraper last April but I felt unable to move forward without a  
> roadmap.

I thought you'd already finished that job :) -- all the scraping and 
parsing code you had written was working perfectly and all that needed 
to be done in terms of data extraction was deciding on the database 
structure so that we could put the data in. As you say the real problem 
was deciding what interface to the data we wanted to provide as that 
might influence the db structure. In any case my apologies for not 
getting back to you properly in April.

>  From the whois records of okfn.org and knowledgeforge.net it appears  
> that you have the two domains and at least one server at your disposal.  
> Could we get an account to set up the database and django?

Yes the OKF has a dedicated server which also hosts 
http://www.freeculture.org.uk/ and is definitely fully available for 
this. I can

Regards,

Rufus

> On Jun 19, 2006, at 10:43 PM, Rufus Pollock wrote:
> 
>> You might want to hit reply-to-all so that you reply goes to the list  
>> as well as to me when respond. Full comments below.
>>
>> Nathan Lewis wrote:
>>
>>> Hi Rufus,
>>> Not a problem. By the way, there was one bit in the scraper that was  
>>> hard coded to the particular page it was looking at, that was the  
>>> foreach my $base (1,21,41,61,81,101,121)  bit. Those numbers were  
>>> supplied by me rather than deduced by the code. All that should be
>>
>>
>> Yes I know. I've also still got that hardcoded.
>>
>>> needed is to read those numbers from the page and then you / we can  
>>> pull data for every year desired.
>>
>>
>> As you say we could just do some minimal parsing of the original  
>> results page to work out what the maximal number. An alternative, 
>> that  might be simpler, would be to just keep going until you get an  
>> exception e.g.
>>
>> count = 0
>> while(True):
>>     base = count * 20 # 20 is results per page
>>     ...
>>     for ii in range(20):
>>         ii += base
>>         ....
>>         # eventually get an exception when you run out of results
>>
>>> Are you going to use that Gauss thing for hosting it?
>>
>>
>> Not sure yet. As a start might be easier just to write some code to  
>> auto create wiki pages rather than using the virtual page -- virtual  
>> pages can't be edited :(
>>
>>> By the way could I look at your Python code?
>>
>>
>> Please do. I've just commited it into the public subversion project  
>> repository (along with some of your code) whose url is:
>>
>>   http://project.knowledgeforge.net/pdw/svn/
>>
>> (nb: you can just browse this using your web browser)
>>
>> You want to look in trunk, specifically at:
>>
>> http://project.knowledgeforge.net/pdw/svn/trunk/bin/ 
>> sound_archive_crawl.py
>>
>> You're *very* welcome to use the repository too, all you need to do 
>> is  sign up for an account on https://www.knowledgeforge.net/, tell 
>> me  your username (so I can give you commit access) and then use a  
>> subversion client to check-out the repository.
>>
>> Regards,
>>
>> Rufus
>>
>>> On Jun 17, 2006, at 10:58 PM, Rufus Pollock wrote:
>>>
>>>> Dear Nathan,
>>>>
>>>> Thanks a lot -- you're a real star! After some painful digging  
>>>> around in the internals of python's mechanize (the benefits of  
>>>> opensource) i finally fixed the remaining bugs and got a python  
>>>> version of your scraper working today so I now have something with  
>>>> which to feed the parser. The aim is to get something up by the end  
>>>> of next week when Tom is going to talk about PD Burn at the 
>>>> iCommons  summit in Rio.
>>>>
>>>> Regards,
>>>>
>>>> Rufus
>>>>
>>>> Nathan Lewis wrote:
>>>>
>>>>> Hi Rufus,
>>>>> Here you go
>>>>> Nathan
>>>>> On Jun 16, 2006, at 6:32 PM, Rufus Pollock wrote:
>>>>>
>>>>>> Dear Nathan,
>>>>>>
>>>>>> Could you post your 'parser' code that you use to extract the 
>>>>>> work  metadata from the html files as I'd like to try porting to 
>>>>>> python  (my perl's terrible ...). Thanks once again in advance.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Rufus
>>>>>>
>>>>>> Nathan Lewis wrote:
>>>>>>
>>>>>>> Hi Rufus,
>>>>>>> I wrote the code to explicitely pull out both the author and the  
>>>>>>> performer from the pages. There is an html comment in the pages  
>>>>>>> <!-- Print the author, if one exists --> that I used to find the  
>>>>>>> author. I think it was just that the library pages listed the  
>>>>>>> author as one of the performers. Could you do a random check?
>>>>>>> Personally I would advocate defining the db schema next because  
>>>>>>> it would make setting up a web interface much easier.
>>>>>>> Matching up the authors sounds to me like the most difficult  
>>>>>>> task. I would be surprised if Wikipedia had entries for more 
>>>>>>> than  a handful of them and programatically matching up names is 
>>>>>>> always  harder than you expect. I think I will limit my 
>>>>>>> involvement to  coding.
>>>>>>> Cheers,
>>>>>>> Nathan
>>>>>>> On Apr 20, 2006, at 8:52 AM, Rufus Pollock wrote:
>>>>>>>
>>>>>>>> Nathan Lewis wrote:
>>>>>>>>
>>>>>>>>> Ok, here it is as a flat file. This was produced using the  
>>>>>>>>> Data::Dumper module in perl but the output is very readable if  
>>>>>>>>> not pretty. This is the info for the 136 recordings from the  
>>>>>>>>> year 1900.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Sorry for the delay in replying - I'm away from the net most of  
>>>>>>>> the time this week. Anyway I have now taken a proper look and  
>>>>>>>> this looks really good -- I can only say once again: great 
>>>>>>>> work!  My only correction is that I think the data dump 
>>>>>>>> erroneously  reuses the performer as author: looking at the BL 
>>>>>>>> search results  it is hard to identify the author in general 
>>>>>>>> (where shown it  seems integrated into the title e.g. 'Faust 
>>>>>>>> (Act 4)/Gounod' --  Gounod is the author).
>>>>>>>>
>>>>>>>> As you mentioned in a follow up mail our next task would be to  
>>>>>>>> start cross correlating works with author (i.e. composer) --  
>>>>>>>> esp. necessary where author not given -- and then find  
>>>>>>>> birth/death dates for these people (maybe using wikipedia).  
>>>>>>>> However this is something that might have to be done by hand.
>>>>>>>>
>>>>>>>> Anyway we have made a fantastic start and now that we know we  
>>>>>>>> have a data source our next task need to get a move on with the  
>>>>>>>> web interface so we can start editing/browsing the data we 
>>>>>>>> have.  this in turn will define our db schema and we can then 
>>>>>>>> customize  the perl to dump our results straight in to the db.
>>>>>>>>
>>>>>>>> ~rufus
>>>>>>>>
>>>>>>>>> On Apr 14, 2006, at 9:42 PM, Rufus Pollock wrote:
>>>>>>>>>
>>>>>>>>>> Nathan Lewis wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Rufus,
>>>>>>>>>>> I don't know python well enough to fix your code though it  
>>>>>>>>>>> does look quite similar. I suspect the python's mechanize  
>>>>>>>>>>> works differently to WWW::Mechanize in perl. Anyway I will  
>>>>>>>>>>> continue with mine since there
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I suspect so too :). Anyway I learnt plenty from porting  
>>>>>>>>>> (having never used mechanize before ...)
>>>>>>>>>>
>>>>>>>>>>> isn't much left to do. I am running mysql 5.0 here but it  
>>>>>>>>>>> should be
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> please do. You're doing a great job.
>>>>>>>>>>
>>>>>>>>>>> importable even if you are running an older version.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> mysql dump should be fine though flat files might be even  
>>>>>>>>>> easier.
>>>>>>>>>>
>>>>>>>>>>> But one question, do we want to search on other years?  
>>>>>>>>>>> Assuming we do, what range? What is the most recent year  
>>>>>>>>>>> something could fall out of UK copyright?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We want all years up until 1955 (frankly we could do with  
>>>>>>>>>> *all* of the data). However the number of works seems to grow  
>>>>>>>>>> *rapidly**, e.g. for 1954 i think there are over 130,000  
>>>>>>>>>> works. Thus for the time being I'd suggest we could just  
>>>>>>>>>> practive on 1900 (or if we want a bit more say, 1900-1910).  
>>>>>>>>>> The best thing is to probably make the script configurable  
>>>>>>>>>> (e.g. we can pass it a date range).
>>>>>>>>>>
>>>>>>>>>> Thanks for your efforts on this. We're making great progress.
>>>>>>>>>>
>>>>>>>>>> ~rufus
>>>>>>>>>>
>