[pdb-discuss] update on progress: db to wiki code nearly done

Rufus Pollock rufus.pollock at okfn.org
Sat Jun 24 09:27:30 UTC 2006


Nathan Lewis wrote:
> 
> Hi Rufus
> 
> On Friday, June 23, 2006, at 02:14PM, Rufus Pollock
> <rufus.pollock at okfn.org> wrote:
> 
> 
>> That's great Nathan. Perhaps you would like to take a stab at
>> parsing this file which contains a list of composers and their
>> birth/death dates:
>> 
>> http://project.knowledgeforge.net/pdw/Composers.txt
>> 
> 
> I have taken a stab at that. I have parsed it into a tab delimited
> csv file for loading into a spreadsheet or for easy transfer into a
> database.

That's perfect. I've taken a quick look at the csv and it looks great 
and now that you have svn access you can uplod your script to bin/

> I did it in perl as it required some bits that would take me a long
> time to learn in python.

No problem.

> There are of course some errors which are easy to see if you view it
> in a spreadsheet. OpenOffice or Gnumeric should load it up without
> any trouble. The errors are some oddball little notes that some
> human(s) put in but the error rate is pretty low and I was able to
> parse ~90% of it correctly in spite of the many curve balls they

It might be good to add a column called 'error' and then put in one of 
y|n|? to indicate the likelihood there was a problem with that entry so 
that we can go back and do those by hand or something.

> threw. Places where you see question marks in the dates were because
> the question marks (and no dates) were in the original data.

Yes, parsing and cleaning data is definitely one of the big challenges 
of this project. We also need to do things like normalize names, work 
out how we deal with imprecise dates e.g. 1900 ca. or 1900-1905 etc.

Regards,

Rufus

ps: if you can, do cc the list (pdb-discuss at lists.okfn.org) on these 
mails so that everyone else knows what we are up to




More information about the pd-discuss mailing list