[okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Thu Jun 20 19:34:28 UTC 2013

Hi All,

I'm writing to ask folks to share their recommendations on* workflows and
open tools for extracting (machine-readable) text* from *(bulk) scanned text
* (i.e. OCR etc).

(Note I'm also, like many other folks, interested in extracting structured
data (e.g. tables) from normal or scanned PDF but I'm *not *asking about
that in this thread ...)

*Context - or the problems I'm interested in*

I've been interested in making the 1st edition of the Oxford English
Dictionary (now in public domain) available online in an open form for a
while - here's the entry in the Labs Ideas
tracker<https://github.com/okfn/ideas/issues/50> about
this which details some of the history [1].

Key point is that thanks to Kragen Sitaker's work in 2005/2006 the whole
text got scanned and is uploaded to
archive.org<http://archive.org/details/oed01arch>.
However, it needed (and needs AFAIK) OCR'ing and then proofing.

Back in the day I took a stab at this using tesseract plus shell scripts
(code now here https://github.com/okfn/oed) but it wasn't great:

- Tesseract quality or non-standard dictionary text was poor
- Chopping up pages (both individual and columns from pages) needed bespoke
automation and was error-prone
- Not clear what best way was to do proofing once done (for the work for
Open Shakespeare and Encyclopaedia Britannica we just used a wiki)

Things have obviously moved on in the last 5 years and I was wondering
what's the *best tools to use for this today (e.g. is tesseract still the
best open-source option).*

Rufus

PS: if you're also interested in this project please let me know :-)

[1]: https://github.com/okfn/ideas/issues/50
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130620/08d6bee3/attachment-0001.html>