[open-bibliography] [ol-discuss] Multivolume works

Wilkin, John jpwilkin at umich.edu
Fri May 4 13:50:44 UTC 2012

Thanks, John.  It's worth considering why the information in HathiTrust
tends to be fuller and more reliable.  I'll try to keep this brief, but
there are a few key pieces that we could explode into fuller explanations.
 The crux here is that those analytics come from the original work of
technical services staff--catalogers and check-in staff--rather than
digitization staff.

In library systems, the bibliographic record contains information at the
title level, both for books and serials.  Hanging off the bibliographic
record are records that we typically call item records, one for each
volume.  In many of our systems, that item record has the analytic (e.g.,
enumeration and chronology) information for multi-volume sets; it also has
a mechanism for tracking the associated volume, and that's typically a

For most of the scanning, including scanning done by Google, done locally
at some partner institutions and vended scanning, the barcode is scanned
during the book scanning process, and the bibliographic information
(including the volume information) essentially travels with the book.
During the Making of America project we learned the lesson of the problems
of manually keying identifier information:  with thousands of volumes
being scanned, keyboarding mistakes were statistically unavoidable.
Wanding a barcode avoids that problem and with it comes nice things like
check digits in the barcode for other forms of downstream validation.  But
more importantly, it allows us to associate all of the elements of the
cataloging and check-in process with the digitized volume.

You'll see that in most (though not all) cases HathiTrust uses the barcode
as part of the identifier. This makes it possible for us to link back to
that item record and thus the analytic information.  It also ensures that
we can rely on systems that manage the corresponding print for correction
and amplification.  If the source library got the volume identifier wrong,
we can try to make sure the correction goes into the source library's
catalog and then the correction can make its way through back to
HathiTrust.  If the series wasn't analyzed or wasn't fully analyzed by the
source library, the source library can undertake a project to do that
analysis later, adding volume information to the item record and
re-exporting it so that it can be associated with the digitized volume.
This ongoing attention to the connection between the print and the digital
is an important part of the management of the collections and can help
other libraries when, for example, they want to correlate their
collections to other libraries.

So, really, HathiTrust can't take credit for getting the info right:  the
source library got it right. (If the info is right, it's also the case
that the source library got it wrong.)  It's a system design success,
informed by understanding collections and collection management processes.
 What's surprising, incidentally, is that, for Google-digitized content,
Google has this information and ignores it.

On 5/4/12 8:02 AM, "John Mark Ockerbloom" <ockerblo at pobox.upenn.edu> wrote:

>Volume collation can be done; HathiTrust does a fairly decent job
>of it for the volumes they collect.  (They don't always have complete
>sets, and they could do a better job at intelligently ordering
>serial volumes, but they do the best job of any of the big US
>Gallica also does some collation, though I haven't used their site enough
>to know how good it is compared to HT.
>HT's metadata is open; I think Gallica's might be too, but I haven't
>There may be some useful automated analysis one can do here, such as
>looking for the number of reported volumes in bibliographic records, and
>page counts for volumes when available, to make a good guess about
>whether a particular scan is one volume of a set or a combined scan.
>(While most mass-scanning projects do go volume by volume, some have
>combined volumes, either due to multiple volumes being bound together
>before scanning, or multiple volume scans being combined after the fact.)
>On 05/03/2012 07:16 PM, Lars Aronsson wrote:
>> On 2012-05-04 00:51, Karen Coyle wrote:
>>> The difficulty seems to arise in the process of scanning. For the
>>> purposes of scanning, each physical volume becomes a scanned file.
>> At any serious scale (e.g. Google or Internet Archive), I think
>> book scanning needs to be organized as multiple work stations,
>> each taking their portion of a day's batch of books, meaning
>> that the 10 or 20 volumes of an encyclopedia will be scanned
>> by different people, each generating a job that goes through
>> OCR and postprocessing, so each volume needs its own metadata
>> record.
>> However, with Google I often find volumes 2 and 5 being all that
>> is scanned. And at the Internet Archive I sometimes find everything
>> except volumes 2 and 7 has been scanned. So there is more chaos
>> than necessary.
>> When we're trying to use scanned books for reference and
>> for proofreading the text, we must hunt down individual parts
>> from different sources. The prime example must be the German
>> branch of Wikisource, here trying to find all 143 parts of the
>> Weimar edition (1887-1919) of Goethe's collected works,
>> Now, the structure shown on that wiki page is something that
>> should go into OpenLibrary.org, because it is open (as all
>> of Wikisource is free and open) bibliographic data.
>open-bibliography mailing list
>open-bibliography at lists.okfn.org

More information about the open-bibliography mailing list