[okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Mon Jun 24 11:36:58 UTC 2013

On 21 June 2013 21:45, Tom Morris <tfmorris at gmail.com> wrote:

This is fantastic Tom - really useful, thanks so much!

> pps.  Below is the text for a few entries as OCR'd by the default Internet Archive OCR (an old version of Abby FineReader).  This was extracted from the ePub file, but if you wanted to work with the Internet Archive version of the OCR, you'd want to start with the Abby version because it contains more info (and perhaps convert it to hOCR as described in Rod Page's post http://iphylo.blogspot.com/2011/07/correcting-ocr-using-hocr-firefox.html).  To see the original page image look at column 3 here: http://archive.org/stream/oed01arch#page/4/mode/1up

Is there a way to get the Abby version direct from the Archive online
or would one need to ask them specially?

Worse case I guess we automate the extraction from the epub and run with that.

> The basic quality isn't terrible, but there's lots of information encoded in symbols, font renderings(e.g. italics, bold), layout, etc that you'd want to try and capture depending on what your goal is.  Fully extracting the rich semantics would be a lot of work (but make the end result much more valuable).

I agree. I think a fully faithful conversion would be the ultimate goal.

[...]

Rufus