[pdb-discuss] pdw web interface (was british library screen scraping)

Rufus Pollock rufus.pollock at okfn.org
Thu Jun 22 10:09:34 UTC 2006


remember to cc pdb-discuss by doing reply-all ;0

Nathan Lewis wrote:
[snip]
>>> Could I pin you down on some points? What is the overall plan for   
>>> providing this data to the public? It looks like you are planning 
>>> on   using django but I am not certain. What database will you be using?
>>
>>
>> The simple answer is that this is up to us and hasn't been decided.  
>> There are several options:
>>
>> 1) Build a site using a web application framework such as django. A  
>> /very/ simple prototype that just used the as-shipped admin interface  
>> was written back in April (as I wrote to the list then)
>>
> I didn't see anything other than raw textual data. I don't know much  
> about django but I am not a python programmer but I am willing to learn  
> it.
> 
>> 2) Build something using an existing wiki system such as moinmoin.  
>> This would work either by
>>   a) using virtual page plugin (mentioned previously) we could show  
>> data from db in wiki very easily. However this would be readonly so a  
>> user could not edit the data.
>>   b) write a script to create auto-create wiki pages using our data.  
>> Advantage is that it is easy to add data and it is a familiar  
>> interface. Problem with this is data is edited in wiki (so not saved  
>> back to a db) and hard to do data validation (foreign keys, data 
>> types  etc). Perhaps this problem could be addressed in some manner 
>> but it  would be quite a bit of work and you end up moving towards (1).
> 
> 
> I don't think the data is at all useful unless it is in a database  
> where it can be queried. Therefore I would say make this a django  

Completely agree (see below).

> application rather than trying to use a wiki. If it possible to set up  
> little mini wiki within each django page for gathering comments and  
> annotation that would be ideal but a pure wiki would not give usable  
> data in the long term.

+1. Entirely in agreement here. A big aim of this project is to provide 
open reusable data and a wiki is *not* the way to go on that front.

>>
>> 3) ... suggestions for other ideas welcome
>>
>> IMO, in the long run I think we need to do something like (1). 
>> However  this is substantially more complex. As a first iteration and 
>> in order  to get something done by Saturday when Tom is going to demo 
>> at  iCommons I would suggest we work on 2b).
> 
> 
> Yes, I basically agree.  But I think getting something going in django  
> or ruby on rails might be easier than going the wiki route. If nothing  
> else there is Catalyst, Perl's equivalent which would allow rapid  
> development of a site.

I think there agreement on this that to do this properly we want a 
proper db and an associated webapp providing a web interface. This 
inteface might have wiki-like features (versioning, easy commenting etc) 
but it would not be a wiki.

However in the interests of having a usable crude prototype ready by 
*this saturday* for Tom to demo at iCommons I suggest we do a 'hack' 
whereby we write a script to dump the db into a set of wiki pages.

>>
>> Whatever the case I think we need to extract the data to a db. Once  
>> there we can decide what we do next.
>>   1. Extract data from site (DONE -- thanks to your efforts)
>>   2. Parse data from web pages (DONE -- again thanks to you)
>>   3. Put data in a db (I've nearly finished coding this)
> 
> 
> Cool. I consider 3. the be the key to making any further progress.

This is just about done. I've just committed a bunch of work including 
python version of html parser and a basic domain model/persistence layer:

http://project.knowledgeforge.net/pdw/trac/changeset/7

By later this morning I should have finished all the code to dump into 
the db. I will then write a script to dump from the db to a wiki.

[snip]

>>> I was ready and willing to finish the job of making the british  
>>> library  screen scraper last April but I felt unable to move forward  
>>> without a  roadmap.
>>
>>
>> I thought you'd already finished that job :) -- all the scraping and  
>> parsing code you had written was working perfectly and all that 
>> needed  to be done in terms of data extraction was deciding on the 
>> database  structure so that we could put the data in. As you say the 
>> real  problem was deciding what interface to the data we wanted to 
>> provide  as that might influence the db structure. In any case my 
>> apologies for  not getting back to you properly in April.
> 
> 
> In my mind the bespoke screen scraper for the year 1900 was more a  
> proof of concept. I don't think the year 1900 recordings are likely to  
> be interesting to many if any artists. We need to scrape all the years  
> in which stuff could be out of copyright. Artists who are actually  

In fact we might as well try scraping /all/ years :)

> interested in a piece will probably be the ones most motivated to  
> investigating the rights on it. Therefore I think we need to get it all  
> into a database with a nice front end so that people can look at it and  
> hopefully find something they want to use.

Completely agree. What I meant by my comments about having done the job 
was that you had written a fully functioning scraper. Even if it is only 
currently used for 1900 i think it can be trivially extended to scrape 
everything (at least i hope so). The moment we have a functioning system 
with 1900 we will start getting more data.

[snip]

Regards,

Rufus




More information about the pd-discuss mailing list