[okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Tom Morris tfmorris at gmail.com
Fri Jun 21 20:45:07 UTC 2013


pps.  Below is the text for a few entries as OCR'd by the default Internet
Archive OCR (an old version of Abby FineReader).  This was extracted from
the ePub file, but if you wanted to work with the Internet Archive version
of the OCR, you'd want to start with the Abby version because it contains
more info (and perhaps convert it to hOCR as described in Rod Page's post
http://iphylo.blogspot.com/2011/07/correcting-ocr-using-hocr-firefox.html).
 To see the original page image look at column 3 here:
http://archive.org/stream/oed01arch#page/4/mode/1up

The basic quality isn't terrible, but there's lots of information encoded
in symbols, font renderings(e.g. italics, bold), layout, etc that you'd
want to try and capture depending on what your goal is.  Fully extracting
the rich semantics would be a lot of work (but make the end result much
more valuable).

Aa, variant of A, adv. Obs., ever.

Aac, aak, aakin ; obs. forms of OAK, OAKEN.

II Aal (al) [the Bengali and Hind, name.] A plant, a species of Morinda
allied to the madder, the roots of which yield a red dye ; also the dye
itself, used in India to colour cotton fabrics.

1875 URE Diet. Arts I. I Has obtained from the aal root a pale yellow
substance which he calls morindin.

Aald, obs. form of OLD.

II Aam (am, gm). Forms: 5-7 alm(e; 7 awme, aume; 7-8 ame, awm, aum. [Du.
aam (pi. amen) ; cogn. w. mod. G. ahm, ohm ; MHG. ante, dine; OHG. ama, 6ma
a cask; ON. a ma a tub; a. L. ama, hama ; ad. Gr. <J/«) a water-bucket. Aam
is the mod. Du. spelling, the Eng. forms being only historical.] A Dutch
and German liquid measure, formerly used in England for Rhenish wine ; a
cask. It varied in different continental cities from 37 to 41 gallons.

1526 Ord. for Royal Hotiseh. Henry VIII, 195 Renish wine 4 fatts, every fat
containing 3 Almez, at 30$. the Alme. 1604 Act I James I, c. xxxii granting
Tonnage and Poundage) Of euery Awme of Rhenish Wine, that is, or shall so
come in, twelue pence. 1696 PHILLIPS, Auln or Aum of Renish Wine, a measure
containing 40 Gallons, and as many pints over and above. 1717 BLOUNT Law
Diet., I find in a very old printed Book thus :—The Rood of Rhenish-wine of
Dor-dreight is ten Awames, and every Awame is fifty Gallons; item the Rood
of Antwarp is xliij Awames, and every Awame is xxxv Gallons. 1721 BAILEY,
Avlne Of Rhenish Wine, a Vessel that contains 40Gallons. 1731 Ibid. vol. II
Aine (of Antwerp) a vessel containing 50 stoops, each stoop 7 pints English
measure.

Aan, -e, obs. forms of ON, and ONE.

Aane, obs. form of AWN.

Aar, obs. northern form of ERE.

II Aard-vark (audvajk). [Adopted from the Dutch Colonists in South Africa,
who have so named it from Du. aarde, in comp. aard- earth + vark=(X..
fearh, OHG. farh, L. forc-us pig.] A South-African quadruped (OrycterSpus
capensis Cuv.1, about the size of the badger, belonging to the
insectivorous division of the Edentata, where it occupies an intermediate
position between the Armadillos and Ant-eaters.

1833 Penny Cyc. I. 3 The aard-vark is in all respects ad-mirably fitted for
the station which Nature has assigned to it. 1834 PRINGLE African Sketches
iv. 176 Such ant-hills as have been broken up and plundered by the
aard-vark, or ant-eater. 1847 CARPENTER Zoology 281 The Aard-vark .. forms
very extensive burrows at a little distance beneath the surface of the
ground, which are sometimes so numer. ous, as to become sources of danger
to horses and waggons traversing the country.

II Aard-wolf (audwulf). [a. Du. aard-wolf, applied to this animal in S.
Africa, f. aarde earth + WOLF.] A South-African carnivorous quadruped
(Protcles Lalandii St. Hil.), of the size of a fox, occupying an
intermediate position between the dogs, hyenas, and civets.

1833 Penny Cyc. I. 4 The genus Proteles contains but a single species, the
Aard-wolf or earth-wolf, so called by the European colonists in the
neighbourhood of Algoa Bay in South Africa. 1847 CARPENTER Zool. 198 The
Aard-wolf (earth-wolf) is evidently the connecting link between the Hyznas
and the Civets.

AARON.

Aaron 1 (e'»ran). Proper name of the patriarch of the Jewish priesthood ;
hence used of a leader of the church. (Rare, and perh. only in loc. cit.)


On Fri, Jun 21, 2013 at 3:22 PM, Tom Morris <tfmorris at gmail.com> wrote:

> p.s.  The other big difference between OCRpus and Tesseract is the size
> and activity of the development community.  The primary for developers for
> both are pretty unresponsive, but Tesseract has a much more active
> community of people who use it, train it for different languages & scripts
> (Ancient Greek, Fraktur, etc), write tools to make training easier, etc.
>
> Tom
>
>
> On Fri, Jun 21, 2013 at 3:19 PM, Tom Morris <tfmorris at gmail.com> wrote:
>
>> Everything that gets uploaded to the Internet Archive (should) get OCR'd
>> automatically.  You can see all the different file formats here:
>> https://ia600401.us.archive.org/7/items/oed01arch/
>>
>> The PDF, ePUB, and DJVu formats should all have text in some form or
>> another.
>>
>> For open source OCR, I think Tesseract leads the bunch with the other
>> main contender being OCRpus.  As Tim mentioned you can mix and match
>> components from the two.  OCRpus used to use Tesseract, but now has its own
>> recognition engine.
>>
>> For a dictionary, you'll probably want to re-train to get support for IPA
>> (or whatever phonetic alphabet they use).  Both Tess & OCRpus support
>> training, but, unfortunately, the best recognition mode in Tess (Cube)
>> doesn't yet have support for training (Google trained all the distributed
>> models themselves internally, but hasn't released the training tools).
>>
>> For proofreading/correction, the two best options which come to mind are
>> the Distributed Proofreads projects which is part of Project Gutenberg and
>> WikiSource.
>>
>> Tom
>>
>>
>> On Thu, Jun 20, 2013 at 3:34 PM, Rufus Pollock <rufus.pollock at okfn.org>wrote:
>>
>>> Hi All,
>>>
>>> I'm writing to ask folks to share their recommendations on* workflows
>>> and open tools for extracting (machine-readable) text* from *(bulk)
>>> scanned text* (i.e. OCR etc).
>>>
>>> (Note I'm also, like many other folks, interested in extracting
>>> structured data (e.g. tables) from normal or scanned PDF but I'm *not *asking
>>> about that in this thread ...)
>>>
>>>  *Context - or the problems I'm interested in*
>>>
>>> I've been interested in making the 1st edition of the Oxford English
>>> Dictionary (now in public domain) available online in an open form for a
>>> while - here's the entry in the Labs Ideas tracker<https://github.com/okfn/ideas/issues/50> about
>>> this which details some of the history [1].
>>>
>>> Key point is that thanks to Kragen Sitaker's work in 2005/2006 the whole
>>> text got scanned and is uploaded to archive.org<http://archive.org/details/oed01arch>.
>>> However, it needed (and needs AFAIK) OCR'ing and then proofing.
>>>
>>> Back in the day I took a stab at this using tesseract plus shell scripts
>>> (code now here https://github.com/okfn/oed) but it wasn't great:
>>>
>>> - Tesseract quality or non-standard dictionary text was poor
>>> - Chopping up pages (both individual and columns from pages) needed
>>> bespoke automation and was error-prone
>>> - Not clear what best way was to do proofing once done (for the work for
>>> Open Shakespeare and Encyclopaedia Britannica we just used a wiki)
>>>
>>> Things have obviously moved on in the last 5 years and I was wondering
>>> what's the *best tools to use for this today (e.g. is tesseract still
>>> the best open-source option).*
>>>
>>> Rufus
>>>
>>> PS: if you're also interested in this project please let me know :-)
>>>
>>> [1]: https://github.com/okfn/ideas/issues/50
>>>
>>> _______________________________________________
>>> okfn-labs mailing list
>>> okfn-labs at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/okfn-labs
>>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130621/776c1e1b/attachment-0002.html>


More information about the okfn-labs mailing list