[okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Fri Jun 21 04:45:37 UTC 2013

Hi Todd,

I don't know. I've used it once or twice, but that was all. I suspect the
free software devs efforts' are focused into Tesseract now (simplifying the
training process would be great).

Cheers,

Vítor Baptista

Developer  |  http://vitorbaptista.com |
LinkedIn<http://www.linkedin.com/in/vitorbaptista>|
@vitorbaptista <http://twitter.com/vitorbaptista>

The Open Knowledge Foundation <http://okfn.org>

*Empowering through Open Knowledge*

http://okfn.org/  |  @okfn <http://twitter.com/okfn>  |  OKF on
Facebook<https://www.facebook.com/OKFNetwork> |
Blog <http://blog.okfn.org/>  |  Newsletter<http://okfn.org/about/newsletter/>

2013/6/21 todd.d.robbins at gmail.com <todd.d.robbins at gmail.com>

> Vitor,
>
> Is gocr still being actively developed? The mailing list archive seems to
> end in 2011. I'd love to get involved in the gocr developer community as I
> play around with it.
>
> Thanks!
>
>
>
> On Thu, Jun 20, 2013 at 9:57 PM, Vitor Baptista <vitor at vitorbaptista.com>wrote:
>
>> Hi Rufus,
>>
>> I've used OCR for a very different purpose, breaking CAPTCHAs, but it
>> might be useful to share experiences. I've had to do it to get public data
>> out of the brazilian Federal Senate website. I tried both gocr and
>> Tesseract. Untrained, both performed terribly (as expected). After I did
>> some basic tweaking on the images, linke converting to monochrome, removing
>> noise, etc., I got 14% success with gocr, and ~12% with Tesseract. Then I
>> started trying to train them.
>>
>> Training gocr is much easier than Tesseract, so I started there. Oddly
>> enough, I couldn't move past the 14%. But, with Tesseract, I got to 49%.
>> Good enough for my needs.
>>
>> FYI, I've used jTessBoxEditor to change the .box files for Tesseract.
>>
>> Cheers,
>>
>> Vítor Baptista
>>
>> Developer  |  http://vitorbaptista.com | LinkedIn<http://www.linkedin.com/in/vitorbaptista>|
>> @vitorbaptista <http://twitter.com/vitorbaptista>
>>
>> The Open Knowledge Foundation <http://okfn.org>
>>
>> *Empowering through Open Knowledge*
>>
>> http://okfn.org/  |  @okfn <http://twitter.com/okfn>  |  OKF on Facebook<https://www.facebook.com/OKFNetwork> |
>> Blog <http://blog.okfn.org/>  |  Newsletter<http://okfn.org/about/newsletter/>
>>
>>
>>
>> 2013/6/20 Rufus Pollock <rufus.pollock at okfn.org>
>>
>>>  Hi All,
>>>
>>> I'm writing to ask folks to share their recommendations on* workflows
>>> and open tools for extracting (machine-readable) text* from *(bulk)
>>> scanned text* (i.e. OCR etc).
>>>
>>> (Note I'm also, like many other folks, interested in extracting
>>> structured data (e.g. tables) from normal or scanned PDF but I'm *not *asking
>>> about that in this thread ...)
>>>
>>>  *Context - or the problems I'm interested in*
>>>
>>> I've been interested in making the 1st edition of the Oxford English
>>> Dictionary (now in public domain) available online in an open form for a
>>> while - here's the entry in the Labs Ideas tracker<https://github.com/okfn/ideas/issues/50> about
>>> this which details some of the history [1].
>>>
>>> Key point is that thanks to Kragen Sitaker's work in 2005/2006 the whole
>>> text got scanned and is uploaded to archive.org<http://archive.org/details/oed01arch>.
>>> However, it needed (and needs AFAIK) OCR'ing and then proofing.
>>>
>>> Back in the day I took a stab at this using tesseract plus shell scripts
>>> (code now here https://github.com/okfn/oed) but it wasn't great:
>>>
>>> - Tesseract quality or non-standard dictionary text was poor
>>> - Chopping up pages (both individual and columns from pages) needed
>>> bespoke automation and was error-prone
>>> - Not clear what best way was to do proofing once done (for the work for
>>> Open Shakespeare and Encyclopaedia Britannica we just used a wiki)
>>>
>>> Things have obviously moved on in the last 5 years and I was wondering
>>> what's the *best tools to use for this today (e.g. is tesseract still
>>> the best open-source option).*
>>>
>>> Rufus
>>>
>>> PS: if you're also interested in this project please let me know :-)
>>>
>>> [1]: https://github.com/okfn/ideas/issues/50
>>>
>>> _______________________________________________
>>> okfn-labs mailing list
>>> okfn-labs at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/okfn-labs
>>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>>>
>>>
>>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>>
>>
>
>
> --
> Tod Robbins
> Digital Collections Librarian, MLIS
> todrobbins.com | @todrobbins <http://www.twitter.com/#!/todrobbins>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130621/edcab233/attachment-0002.html>