[okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Sun Aug 25 03:34:05 UTC 2013

On 24/08/13 03:28 AM, Jonathan Gray wrote:
> 
> What do we need to move forward with an Open OED project [1]? It would
> be really cool if there were any way to break the dictionary down into
> entries that people could help to proofread and correct. Any thoughts on
> that front? Anyone else interested in helping?

When I took a look at this before, what struck me was the range of
characters used in the dictionary, both phonetic and typographic symbols.

On the OCR front, this would require training Tesseract to recognise them.

On the crowdsourced editing front, I think an on-screen keyboard with
those characters would really help input.

I've just spent a couple of days correcting the OCR for (just) the term
names from an old art dictionary OCR-ed on archive.org :

https://gitorious.org/robmyers/art-word-lists/trees/master/fairholt-dictionary

I had to change so much that I was left wondering whether the OED's own
digitization technique (get multiple people to manually transcribe the
print copy by hand and compare them for errors) wouldn't be a better
approach for documents where OCR is going to be anything less than
nearly perfect.

- Rob.