[pdb-discuss] Parsing BBC CAIRNS data

Rufus Pollock rufus.pollock at okfn.org
Thu Apr 9 14:52:27 UTC 2009


Dear Nathan (and James and Yves),

To follow up my earlier message the raw BBC CAIRNS data is now
uploaded to archive.org:

<http://archive.org/details/pdw_bbc_cairns>

It would be really great if we could take forward the existing (rather
basic) parser:

<http://knowledgeforge.net/pdw/hg/file/tip/pdw/getdata/cairns.pyc>

In particular it would be nice to actually have this so that data was
actually being loaded into the db along the lines of the code for
other datasets, see e.g. Loader.load_to_db in

<http://knowledgeforge.net/pdw/hg/file/tip/pdw/getdata/ngcoba2.py>

This should yield a huge amount more data which we can add to PDW (or
export to other formats -- RDF, etc etc).

Rufus

---------- Forwarded message ----------
From: Rufus Pollock <rufus.pollock at okfn.org>
Date: 2009/3/2
Subject: Getting Moving Again ...
To: Nathan Lewis <nathanlewis42 at googlemail.com>, James Casbon
<casbon at gmail.com>, Yves Raimond <yves.raimond at gmail.com>
Cc: pdb-discuss at lists.okfn.org


Dear All,

It's been a bit quiet here over the last year. However, things have
been happening behind the scenes and I think now is a good time to get
things going a bit more in public.

With the basic domain model now fixed up, we're now in a good position
to really start getting into the data parsing and loading in a big
way. (I note that once parsed into a decent format the data could
easily be exported in a lot of different forms including RDF ...)

I know Nathan has already expressed willingness to have a go at
improving the existing BBC Parsers. It would be great to have a couple
more volunteers able to spare a bit of time to help out on this -- or
any other parts of the project (e.g. providing edit interface, coding
up the PD calculators etc).

More details below of what's been happening and what data we'd like to
parse and upload into the system below.

Rufus


## Update

Over the last couple of months the basic domain model has been
completely rewritten along FRBR lines and should now be pretty stable:

 <http://knowledgeforge.net/pdw/hg/file/tip/pdw/model/frbr.py>

With this done we should look to start bulk loading data into it.
Suggested items are:

1. KCL CHARM Catalogue Data: <http://www.charm.kcl.ac.uk/index.html>.
I've made a start on this and you can see a rough cut of their
Schubert data in PDW right now:

 <http://publicdomainworks.net/item/>

2. Donated BBC Catalogue Data (CAIRNS). We made a start on this back
in summer 2007 thanks to work of James and Yves. Full dataset now
posted in archive.org and it would be great to load this into the
system:

 <http://ia331417.us.archive.org/3/items/pdw_bbc_cairns/>




More information about the pd-discuss mailing list