[open-bibliography] Place of Publication data from the BL dataset

Fri Nov 26 16:51:58 UTC 2010

On Thu, Nov 25, 2010 at 7:26 PM, William Waites <ww at eris.okfn.org> wrote:
> * [2010-11-25 18:58:27 -0500] Tom Morris <tfmorris at gmail.com> écrit:
> ]
> ] One of the difficulties of the current dataset is that it has no URIs
> ] assigned and very few strong identifiers of any type that can be used
> ] as handles to reference things.  You could, for example, go through
> ] the extracted publication places and group duplicates together using
> ] Google Refine, but you'd have no way to use that cleaned data set to
> ] improve the original or any of the extracted copies.
>
> Indeed. One of the reasons I didn't invent unique URIs for the places
> when doing the first step of transformation for what is in
> http://bnb.bibliographica.org/ would be that in effect that would just
> be a process of skolemisation -- not very helpful.

Minting URIs for your copy of the data is too late in the process to
be helpful anyway.  What's needed to is to have the original data
export include any available identifiers.

> Or even better, loop over all books with the same publisher and place
> name label and [equate the text literal with a geonames ID].

That's assuming that the place of publication is stored on a
per-publisher basis, not a per-book basis in the British Library
database.  That type of knowledge about the schema used in the
internal database will be key to making informed decisions about the
data.  Where is that schema documented?

> If we can get there, some sort fun game that people can play that
> creates SPARQL queries like this, we can fix the data.

The process you described fixes your *copy* of the data.  It doesn't
fix my copy or Karen's copy, and it certainly doesn't help the
original data.  Without identifiers in the original data, it
complicates merging everyone's independent improvement results (as Jim
pointed out).

On Fri, Nov 26, 2010 at 4:39 AM, Deliot, Corine <Corine.Deliot at bl.uk> wrote:
> The conversion currently takes the place of publication, distribution,
> etc. from the 260$a. We're considering including the 008/15-17 in future
> releases.

What does that mean in English?

Is there a listing available someplace of what fields in the dump came
from free form text fields vs database records which guarantee that
anything linked to that record always has the same text value?  For
example, book titles and edition statements are almost certainly
free-form text fields, but I'd expect authors to have individual
records where every book linked to the same record to have the same
author name in the dump.

Is there a comprehensive list of which fields are free-form vs
database record backed?  Knowing the internal schema would be very
helpful in making use of the data.

On Fri, Nov 26, 2010 at 2:45 AM, Ben O'Steen <bosteen at gmail.com> wrote:
> (And as Karen has just pointed out, the reason why I am exploring this field
> is to aid disambiguation of publishers. Having created the overview that I
> know I need,  I thought to share it here.)

That makes sense, although I'd have thought that publisher data is
noisy enough and low value enough that it'd be pretty far down on the
priority list to clean up.

More interesting I think is whether these text strings represent one
author or three or ...:

  Wilson, Angus, 1913-1991
  Wilson, Angus, 1913-1991,
  Wilson, Angus, 1913-1991.
  Wilson, Angus.
  Willson, Angus.

My gut tells me that at least the first three text strings probably
represent a single author, but that's not what the database seems to
think.

Tom

p.s. Can some librarian type tell me what the trailing period (full
stop) means?  It's not used consistently, but it appears much, MUCH
more frequently in library data than anywhere else I've seen.