[pdb-discuss] Re: Author name lists

Mon Jun 25 09:25:53 UTC 2007

Wow, Andrew this is fantastic. To respond:

a) This is definitely useful to us.

b) Suggestions on how to proceed: we are currently focusing on 
recordings so we are particularly interested in composers and data on 
recordings themselves.

We've already got a bit of data on composers (see recent blogpost[1]) 
but we need more. However what we are really lacking is data on the 
recordings themselves. One possibility is to extract this kind of info 
from the Library of Congress as well (this was briefly discussed in [2]) 
-- given your expertise in this area this might be a good way to go.

Alternatively the BBC archives have been kind enough to make a whole 
bunch of recordings data available to us. If you like the challenge of 
parsing obscurish formats we've got over 1GB that we need to parse. You 
can get a sample from:

   http://p.knowledgeforge.net/pdw/A1_sample.txt

c) We are currently storing our code (though not all the data as it too 
large ...) in a subversion repository:

   http://p.knowledgeforge.net/pdw/svn/

If you'd like to use this you'd be very welcome. Just sign up for an 
account on http://www.knowledgeforge.net/ and let me know your username 
and you can have commit access.

~rufus

[1]:<http://www.publicdomainworks.net/2007/06/10/public-domain-composers/>
[2]:<http://lists.okfn.org/pipermail/pdb-discuss/2006-August/000086.html>

Andrew Gray wrote:
> Many months ago, I went to the Open Knowledge meeting in London, and
> sometime that evening complained at Rufus for getting confused about
> library database conventions, I felt I ought to do something helpful
> rather than just barrack from the back of the room...
> 
> I've tracked down a copy of the Library of Congress authority records,
> and after some fiddling I think it's possible to extract a list of
> some four million (give or take) named authors from them. What this
> represents is everyone who is named as a primary author in a catalogue
> record for a piece of material held by the Library of Congress - which
> I guess means the authors of most non-ephemeral textual work published
> in the Anglosphere since 1800. (Journalism is pretty much out, sadly,
> and I have no idea what the sheet-music situation will be like)
> 
> [There are seperate datasets for "corporate names" and "conference or
> meeting names", but these are less immediately useful for copyright
> purposes and I'll leave them out for now]
> 
> The information with them isn't very comprehensive - the basic record
> is just a name and a string of metadata on how it should be prepared,
> as this is basically a database for ensuring that the names on
> catalogue records are accurate.
> 
> There may be other data attatched to that name - publication or life
> dates, full forms of initials, or a parenthetical remark like "Irish
> poet" - in order to distinguish between people. (You can imagine how
> many John Smiths have written books...) Then additional notes, maybe a
> sentence or two on some tricky detail of nomenclature or alternate
> name headings. Finally, there is a field stating where the record was
> originally taken from. This is the second little goldmine - in almost
> all cases, the record will have been based on the person's entry in
> the book being catalogued, so we have a note of one of the books they
> wrote *and, probably, its date*. Very useful for making educated
> guesses on who an author was - novelist or biblical historian?
> 
> But we can't *rely* on having anything more than a name and, mostly,
> one of their works. About a third seem to have a date field, based on
> a sample I took last night, usually a birthdate, but birth-and-death
> dates are more common the further we go back. Maybe ten to twenty
> percent - so five percent overall? - seem to have "reliably public
> domain" dates, born over 180 years ago.
> 
> It's all in an xmlish form at the moment, and totals a few gb - I'm in
> the process of writing a script to clean it up and junk some of the
> data we don't need - but would it be of any use? Even if most records
> don't actually contain dates, it does provide a framework we can use
> as the basis for people researching them.
> 
> Suggestions on how to proceed would be appreciated - what data is most
> useful for our purposes? what could we do with it?...
>