[okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Tom Morris tfmorris at gmail.com
Fri Jun 21 19:19:07 UTC 2013


Everything that gets uploaded to the Internet Archive (should) get OCR'd
automatically.  You can see all the different file formats here:
https://ia600401.us.archive.org/7/items/oed01arch/

The PDF, ePUB, and DJVu formats should all have text in some form or
another.

For open source OCR, I think Tesseract leads the bunch with the other main
contender being OCRpus.  As Tim mentioned you can mix and match components
from the two.  OCRpus used to use Tesseract, but now has its own
recognition engine.

For a dictionary, you'll probably want to re-train to get support for IPA
(or whatever phonetic alphabet they use).  Both Tess & OCRpus support
training, but, unfortunately, the best recognition mode in Tess (Cube)
doesn't yet have support for training (Google trained all the distributed
models themselves internally, but hasn't released the training tools).

For proofreading/correction, the two best options which come to mind are
the Distributed Proofreads projects which is part of Project Gutenberg and
WikiSource.

Tom


On Thu, Jun 20, 2013 at 3:34 PM, Rufus Pollock <rufus.pollock at okfn.org>wrote:

> Hi All,
>
> I'm writing to ask folks to share their recommendations on* workflows and
> open tools for extracting (machine-readable) text* from *(bulk) scanned
> text* (i.e. OCR etc).
>
> (Note I'm also, like many other folks, interested in extracting structured
> data (e.g. tables) from normal or scanned PDF but I'm *not *asking about
> that in this thread ...)
>
> *Context - or the problems I'm interested in*
>
> I've been interested in making the 1st edition of the Oxford English
> Dictionary (now in public domain) available online in an open form for a
> while - here's the entry in the Labs Ideas tracker<https://github.com/okfn/ideas/issues/50> about
> this which details some of the history [1].
>
> Key point is that thanks to Kragen Sitaker's work in 2005/2006 the whole
> text got scanned and is uploaded to archive.org<http://archive.org/details/oed01arch>.
> However, it needed (and needs AFAIK) OCR'ing and then proofing.
>
> Back in the day I took a stab at this using tesseract plus shell scripts
> (code now here https://github.com/okfn/oed) but it wasn't great:
>
> - Tesseract quality or non-standard dictionary text was poor
> - Chopping up pages (both individual and columns from pages) needed
> bespoke automation and was error-prone
> - Not clear what best way was to do proofing once done (for the work for
> Open Shakespeare and Encyclopaedia Britannica we just used a wiki)
>
> Things have obviously moved on in the last 5 years and I was wondering
> what's the *best tools to use for this today (e.g. is tesseract still the
> best open-source option).*
>
> Rufus
>
> PS: if you're also interested in this project please let me know :-)
>
> [1]: https://github.com/okfn/ideas/issues/50
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130621/d49f5610/attachment-0002.html>


More information about the okfn-labs mailing list