[okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Fri Jun 21 03:57:07 UTC 2013

Hi Rufus,

I've used OCR for a very different purpose, breaking CAPTCHAs, but it might
be useful to share experiences. I've had to do it to get public data out of
the brazilian Federal Senate website. I tried both gocr and Tesseract.
Untrained, both performed terribly (as expected). After I did some basic
tweaking on the images, linke converting to monochrome, removing noise,
etc., I got 14% success with gocr, and ~12% with Tesseract. Then I started
trying to train them.

Training gocr is much easier than Tesseract, so I started there. Oddly
enough, I couldn't move past the 14%. But, with Tesseract, I got to 49%.
Good enough for my needs.

FYI, I've used jTessBoxEditor to change the .box files for Tesseract.

Cheers,

Vítor Baptista

Developer  |  http://vitorbaptista.com |
LinkedIn<http://www.linkedin.com/in/vitorbaptista>|
@vitorbaptista <http://twitter.com/vitorbaptista>

The Open Knowledge Foundation <http://okfn.org>

*Empowering through Open Knowledge*

http://okfn.org/  |  @okfn <http://twitter.com/okfn>  |  OKF on
Facebook<https://www.facebook.com/OKFNetwork> |
Blog <http://blog.okfn.org/>  |  Newsletter<http://okfn.org/about/newsletter/>

2013/6/20 Rufus Pollock <rufus.pollock at okfn.org>

> Hi All,
>
> I'm writing to ask folks to share their recommendations on* workflows and
> open tools for extracting (machine-readable) text* from *(bulk) scanned
> text* (i.e. OCR etc).
>
> (Note I'm also, like many other folks, interested in extracting structured
> data (e.g. tables) from normal or scanned PDF but I'm *not *asking about
> that in this thread ...)
>
>  *Context - or the problems I'm interested in*
>
> I've been interested in making the 1st edition of the Oxford English
> Dictionary (now in public domain) available online in an open form for a
> while - here's the entry in the Labs Ideas tracker<https://github.com/okfn/ideas/issues/50> about
> this which details some of the history [1].
>
> Key point is that thanks to Kragen Sitaker's work in 2005/2006 the whole
> text got scanned and is uploaded to archive.org<http://archive.org/details/oed01arch>.
> However, it needed (and needs AFAIK) OCR'ing and then proofing.
>
> Back in the day I took a stab at this using tesseract plus shell scripts
> (code now here https://github.com/okfn/oed) but it wasn't great:
>
> - Tesseract quality or non-standard dictionary text was poor
> - Chopping up pages (both individual and columns from pages) needed
> bespoke automation and was error-prone
> - Not clear what best way was to do proofing once done (for the work for
> Open Shakespeare and Encyclopaedia Britannica we just used a wiki)
>
> Things have obviously moved on in the last 5 years and I was wondering
> what's the *best tools to use for this today (e.g. is tesseract still the
> best open-source option).*
>
> Rufus
>
> PS: if you're also interested in this project please let me know :-)
>
> [1]: https://github.com/okfn/ideas/issues/50
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130621/4f27f9a1/attachment-0002.html>