[pdb-discuss] british library screen scraping
Rufus Pollock
rufus.pollock at okfn.org
Mon Apr 10 15:28:46 UTC 2006
Tim Cowlishaw wrote:
>> http://en.wikipedia.org/wiki/List_of_Classical_composers
>
>
>
> oops.... just realised, that's only a list of composers specifically
> from the classical era rather than more generally what we're referring
> to as 'classical' music. there are doubtless other pages on wikipedia
> in the same format which we could also scrape, or there's this:
>
> http://www.classical-composers.org/cgi-bin/ccd.cgi?comp=_phome
>
> any thoughts?
I think this is a good suggestion for where to obtain info on composers
however and if someone would like to extract this info (it can be
temporarily stored in a file if we can't add it to the db) that would be
great. However:
1. I think that we want to concentrate on getting the metadata on
recordings out first
2. You probably don't want to screenscrape wikipedia. It really
annoys them and they already provide a db dump. What is really needed is
a script to extract the needed info back into a structured form from the
page html.
~rufus
More information about the pd-discuss
mailing list