[open-bibliography] Place of Publication data from the BL dataset

Thu Nov 25 23:58:27 UTC 2010

On Thu, Nov 25, 2010 at 10:35 AM, Jim Pitman <pitman at stat.berkeley.edu> wrote:
> Ben O'Steen <bosteen at gmail.com> wrote:
>
>> I've pulled out the place of publication(?) data (isbd:P1016) from the
>> BL BNB dataset and compiled it into a sorted spreadsheet, focussed on
>> the locations:
>> http://bit.ly/g2l2tM
>
> https://spreadsheets.google.com/pub?key=0Ai_sd71RSo30dFoteW1MSS1KRWx6dVFQTW1JdnNVYWc&hl=en

A spreadsheet instead of a word processing document would definitely
be easier to deal with.  It would also allow direct import into Google
Refine.

> http://code.google.com/p/google-refine/
> http://www.needlebase.com/
...
> Does anyone on this list have experience using these tools?

I wrote the Google Spreadsheets importer for Google Refine and was an
early beta user of Needle.  I'd be happy to answer questions on
either.  Neither of them really have any direct support for
collaborative data cleanup.  Pretty much all division of labor and
coordination needs to be managed externally.  One of the reasons I
added the Google Spreadsheets support was so that these cloud based
sources could be used as a rudimentary sharing mechanism.

One of the difficulties of the current dataset is that it has no URIs
assigned and very few strong identifiers of any type that can be used
as handles to reference things.  You could, for example, go through
the extracted publication places and group duplicates together using
Google Refine, but you'd have no way to use that cleaned data set to
improve the original or any of the extracted copies.

As for the overall workflow, what's missing, in my opinion, is a
crowdsourcing framework that allows the creation of simple 'data
games' similar the old Freebase games (e.g. Genderizer, Typewriter,
etc) but without the requirement for in-depth technical expertise and
custom programming.  It should be possible for a domain expert to set
up a queue of tasks that are simple enough to be crowdsource and to
get the tasks into the queue and the results back out again in a
simple and straightforward manner.  The two main things that need to
be done are 1) cluster duplicate variants together and 2) reconcile
things with a source of strong identifiers such as Freebase, DBpedia,
or the Library of Congress.

Tom

p.s. Personally, I think having the data pulled out of context
actually makes it harder to use, not easier.  If I look at the string
'Cambridge,' it could represent pretty much any place, but if I can
see what book was published there, I can probably guess whether it's
the town across the river from Boston or someplace off on an island
across the ocean.