[pdb-discuss] british library screen scraping

Rufus Pollock rufus.pollock at okfn.org
Mon Apr 10 15:28:46 UTC 2006


Tim Cowlishaw wrote:
>> http://en.wikipedia.org/wiki/List_of_Classical_composers
> 
> 
> 
> oops.... just realised, that's only a list of composers specifically
> from the classical era rather than more generally what we're referring
> to as 'classical' music. there are doubtless other  pages on wikipedia
> in the same format which we could also scrape, or there's this:
> 
> http://www.classical-composers.org/cgi-bin/ccd.cgi?comp=_phome
> 
> any thoughts?

I think this is a good suggestion for where to obtain info on composers 
however and if someone would like to extract this info (it can be 
temporarily stored in a file if we can't add it to the db) that would be 
great. However:

   1. I think that we want to concentrate on getting the metadata on 
recordings out first
   2. You probably don't want to screenscrape wikipedia. It really 
annoys them and they already provide a db dump. What is really needed is 
a script to extract the needed info back into a structured form from the 
page html.

~rufus




More information about the pd-discuss mailing list