[pdb-discuss] british library screen scraping

Dan Leech dan.t.leech at gmail.com
Mon Apr 10 18:35:03 UTC 2006


sorry there are several categories of composers, baroque, renaissance, etc..
I just used classical as an example..

the script I had in mind was one to just import the HTML pages and then
extract the data into a tempoary DB.

On 10/04/06, Rufus Pollock <rufus.pollock at okfn.org> wrote:
>
> Tim Cowlishaw wrote:
> >> http://en.wikipedia.org/wiki/List_of_Classical_composers
> >
> >
> >
> > oops.... just realised, that's only a list of composers specifically
> > from the classical era rather than more generally what we're referring
> > to as 'classical' music. there are doubtless other  pages on wikipedia
> > in the same format which we could also scrape, or there's this:
> >
> > http://www.classical-composers.org/cgi-bin/ccd.cgi?comp=_phome
> >
> > any thoughts?
>
> I think this is a good suggestion for where to obtain info on composers
> however and if someone would like to extract this info (it can be
> temporarily stored in a file if we can't add it to the db) that would be
> great. However:
>
>    1. I think that we want to concentrate on getting the metadata on
> recordings out first
>    2. You probably don't want to screenscrape wikipedia. It really
> annoys them and they already provide a db dump. What is really needed is
> a script to extract the needed info back into a structured form from the
> page html.
>
> ~rufus
>



--
Dan Leech
Virtual Art Solutions
www.dantleech.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/pd-discuss/attachments/20060410/eedb830c/attachment.html>


More information about the pd-discuss mailing list