[okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Fri Jun 21 04:24:24 UTC 2013

Vitor,

Is gocr still being actively developed? The mailing list archive seems to
end in 2011. I'd love to get involved in the gocr developer community as I
play around with it.

Thanks!

On Thu, Jun 20, 2013 at 9:57 PM, Vitor Baptista <vitor at vitorbaptista.com>wrote:

> Hi Rufus,
>
> I've used OCR for a very different purpose, breaking CAPTCHAs, but it
> might be useful to share experiences. I've had to do it to get public data
> out of the brazilian Federal Senate website. I tried both gocr and
> Tesseract. Untrained, both performed terribly (as expected). After I did
> some basic tweaking on the images, linke converting to monochrome, removing
> noise, etc., I got 14% success with gocr, and ~12% with Tesseract. Then I
> started trying to train them.
>
> Training gocr is much easier than Tesseract, so I started there. Oddly
> enough, I couldn't move past the 14%. But, with Tesseract, I got to 49%.
> Good enough for my needs.
>
> FYI, I've used jTessBoxEditor to change the .box files for Tesseract.
>
> Cheers,
>
> Vítor Baptista
>
> Developer  |  http://vitorbaptista.com | LinkedIn<http://www.linkedin.com/in/vitorbaptista>|
> @vitorbaptista <http://twitter.com/vitorbaptista>
>
> The Open Knowledge Foundation <http://okfn.org>
>
> *Empowering through Open Knowledge*
>
> http://okfn.org/  |  @okfn <http://twitter.com/okfn>  |  OKF on Facebook<https://www.facebook.com/OKFNetwork> |
> Blog <http://blog.okfn.org/>  |  Newsletter<http://okfn.org/about/newsletter/>
>
>
>
> 2013/6/20 Rufus Pollock <rufus.pollock at okfn.org>
>
>>  Hi All,
>>
>> I'm writing to ask folks to share their recommendations on* workflows
>> and open tools for extracting (machine-readable) text* from *(bulk)
>> scanned text* (i.e. OCR etc).
>>
>> (Note I'm also, like many other folks, interested in extracting
>> structured data (e.g. tables) from normal or scanned PDF but I'm *not *asking
>> about that in this thread ...)
>>
>>  *Context - or the problems I'm interested in*
>>
>> I've been interested in making the 1st edition of the Oxford English
>> Dictionary (now in public domain) available online in an open form for a
>> while - here's the entry in the Labs Ideas tracker<https://github.com/okfn/ideas/issues/50> about
>> this which details some of the history [1].
>>
>> Key point is that thanks to Kragen Sitaker's work in 2005/2006 the whole
>> text got scanned and is uploaded to archive.org<http://archive.org/details/oed01arch>.
>> However, it needed (and needs AFAIK) OCR'ing and then proofing.
>>
>> Back in the day I took a stab at this using tesseract plus shell scripts
>> (code now here https://github.com/okfn/oed) but it wasn't great:
>>
>> - Tesseract quality or non-standard dictionary text was poor
>> - Chopping up pages (both individual and columns from pages) needed
>> bespoke automation and was error-prone
>> - Not clear what best way was to do proofing once done (for the work for
>> Open Shakespeare and Encyclopaedia Britannica we just used a wiki)
>>
>> Things have obviously moved on in the last 5 years and I was wondering
>> what's the *best tools to use for this today (e.g. is tesseract still
>> the best open-source option).*
>>
>> Rufus
>>
>> PS: if you're also interested in this project please let me know :-)
>>
>> [1]: https://github.com/okfn/ideas/issues/50
>>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>>
>>
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>

-- 
Tod Robbins
Digital Collections Librarian, MLIS
todrobbins.com | @todrobbins <http://www.twitter.com/#!/todrobbins>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130620/3124742c/attachment-0002.html>