[pdb-discuss] parsing the bbc data

Yves Raimond yves.raimond at gmail.com
Tue Jul 24 16:04:41 UTC 2007


Hi!

> I had a first shot at parsing the bbc data (see attached file if
> you're interested).  It seems pretty well structured, but I wonder if
> anyone can tell what the columns are?
>
> Please have a look at http://p.knowledgeforge.net/pdw/A1_parsed_20070721.csv

It looks good:-) Sorry I did not have time yet to parse it.

>
> the first column is the title, the second is the 'pre title' (ie the,
> das, etc), but what about the rest?
>
> I suppose for pdw we're only interested in a few.

I think the main problem is the parsing of dates - their serialization
is really heterogeneous, from "operetta (1907)" to "5 june 1907" and
"1st New York performance 6 January 1906", I think we will quickly run
into some real troubles:-(

Rufus, I think you mentioned some sort of a manual the other day - is
it available somewhere?

Maybe we could just look for a range of numbers, that we know are
years "of interest" according to what we are looking for (public
domain works)? It could make things much easier, and leave the
complexity to the identification of works/performances.

Cheers,
y




More information about the pd-discuss mailing list