[open-bibliography] [ol-discuss] Multivolume works
John Mark Ockerbloom
ockerblo at pobox.upenn.edu
Fri May 4 12:02:42 UTC 2012
Volume collation can be done; HathiTrust does a fairly decent job
of it for the volumes they collect. (They don't always have complete
sets, and they could do a better job at intelligently ordering
serial volumes, but they do the best job of any of the big US collections.)
Gallica also does some collation, though I haven't used their site enough
to know how good it is compared to HT.
HT's metadata is open; I think Gallica's might be too, but I haven't checked.
There may be some useful automated analysis one can do here, such as
looking for the number of reported volumes in bibliographic records, and
page counts for volumes when available, to make a good guess about
whether a particular scan is one volume of a set or a combined scan.
(While most mass-scanning projects do go volume by volume, some have
combined volumes, either due to multiple volumes being bound together
before scanning, or multiple volume scans being combined after the fact.)
John
On 05/03/2012 07:16 PM, Lars Aronsson wrote:
> On 2012-05-04 00:51, Karen Coyle wrote:
>> The difficulty seems to arise in the process of scanning. For the
>> purposes of scanning, each physical volume becomes a scanned file.
>
> At any serious scale (e.g. Google or Internet Archive), I think
> book scanning needs to be organized as multiple work stations,
> each taking their portion of a day's batch of books, meaning
> that the 10 or 20 volumes of an encyclopedia will be scanned
> by different people, each generating a job that goes through
> OCR and postprocessing, so each volume needs its own metadata
> record.
>
> However, with Google I often find volumes 2 and 5 being all that
> is scanned. And at the Internet Archive I sometimes find everything
> except volumes 2 and 7 has been scanned. So there is more chaos
> than necessary.
>
> When we're trying to use scanned books for reference and
> for proofreading the text, we must hunt down individual parts
> from different sources. The prime example must be the German
> branch of Wikisource, here trying to find all 143 parts of the
> Weimar edition (1887-1919) of Goethe's collected works,
> http://de.wikisource.org/wiki/Goethe#Sophien-_oder_Weimarer_Ausgabe_.28WA.29
>
> Now, the structure shown on that wiki page is something that
> should go into OpenLibrary.org, because it is open (as all
> of Wikisource is free and open) bibliographic data.
>
>
>
More information about the open-bibliography
mailing list