[pdb-discuss] [Fwd: Author name lists]

Rufus Pollock rufus.pollock at okfn.org
Mon Jun 25 08:55:38 UTC 2007


Forwarding as Andrew sent from his non-list address (Andrew: i think you 
are subscribed under generalist.org.uk but I've now added this email 
address to the sender filters so it should work in future :) ).

~rufus

-------- Original Message --------
Subject: Author name lists
Date: Mon, 25 Jun 2007 09:49:10 +0100
From: Andrew Gray <obfuscated>
Reply-To: andrew.gray at dunelm.org.uk
To: pdb-discuss at lists.okfn.org
CC: Rufus Pollock <obfuscated>

Many months ago, I went to the Open Knowledge meeting in London, and
sometime that evening complained at Rufus for getting confused about
library database conventions, I felt I ought to do something helpful
rather than just barrack from the back of the room...

I've tracked down a copy of the Library of Congress authority records,
and after some fiddling I think it's possible to extract a list of
some four million (give or take) named authors from them. What this
represents is everyone who is named as a primary author in a catalogue
record for a piece of material held by the Library of Congress - which
I guess means the authors of most non-ephemeral textual work published
in the Anglosphere since 1800. (Journalism is pretty much out, sadly,
and I have no idea what the sheet-music situation will be like)

[There are seperate datasets for "corporate names" and "conference or
meeting names", but these are less immediately useful for copyright
purposes and I'll leave them out for now]

The information with them isn't very comprehensive - the basic record
is just a name and a string of metadata on how it should be prepared,
as this is basically a database for ensuring that the names on
catalogue records are accurate.

There may be other data attatched to that name - publication or life
dates, full forms of initials, or a parenthetical remark like "Irish
poet" - in order to distinguish between people. (You can imagine how
many John Smiths have written books...) Then additional notes, maybe a
sentence or two on some tricky detail of nomenclature or alternate
name headings. Finally, there is a field stating where the record was
originally taken from. This is the second little goldmine - in almost
all cases, the record will have been based on the person's entry in
the book being catalogued, so we have a note of one of the books they
wrote *and, probably, its date*. Very useful for making educated
guesses on who an author was - novelist or biblical historian?

But we can't *rely* on having anything more than a name and, mostly,
one of their works. About a third seem to have a date field, based on
a sample I took last night, usually a birthdate, but birth-and-death
dates are more common the further we go back. Maybe ten to twenty
percent - so five percent overall? - seem to have "reliably public
domain" dates, born over 180 years ago.

It's all in an xmlish form at the moment, and totals a few gb - I'm in
the process of writing a script to clean it up and junk some of the
data we don't need - but would it be of any use? Even if most records
don't actually contain dates, it does provide a framework we can use
as the basis for people researching them.

Suggestions on how to proceed would be appreciated - what data is most
useful for our purposes? what could we do with it?...

-- 
- Andrew Gray
  andrew.gray at dunelm.org.uk




More information about the pd-discuss mailing list