[okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Rufus Pollock rufus.pollock at okfn.org
Fri Jun 21 15:49:20 UTC 2013


[cc'ing back on list]

@Tim (and others): what's the difference between ocropus and tesseract? My
understanding was ocropus uses tesseract?

Has anyone used both and have any experiences to report.

Rufus


On 20 June 2013 21:33, Tim McNamara <paperless at timmcnamara.co.nz> wrote:

> Tesseract can do better if you give it sections which have already been
> segmented into paragraphs by Ocropus. I forget the correct terminology.
> Ocropus is pretty good at finding columns, paragraphs & word boundaries
> etc. You can then experiment with either tesseract or ocropus (or insert
> all other tool here) on these much smaller steps. Recombining things is a
> pain, but isn't too hard IIRC as the location data is stored in the file
> name.
>
> On 21 June 2013 07:34, Rufus Pollock <rufus.pollock at okfn.org> wrote:
>
>> Hi All,
>>
>> I'm writing to ask folks to share their recommendations on* workflows
>> and open tools for extracting (machine-readable) text* from *(bulk)
>> scanned text* (i.e. OCR etc).
>>
>> (Note I'm also, like many other folks, interested in extracting
>> structured data (e.g. tables) from normal or scanned PDF but I'm *not *asking
>> about that in this thread ...)
>>
>>  *Context - or the problems I'm interested in*
>>
>> I've been interested in making the 1st edition of the Oxford English
>> Dictionary (now in public domain) available online in an open form for a
>> while - here's the entry in the Labs Ideas tracker<https://github.com/okfn/ideas/issues/50> about
>> this which details some of the history [1].
>>
>> Key point is that thanks to Kragen Sitaker's work in 2005/2006 the whole
>> text got scanned and is uploaded to archive.org<http://archive.org/details/oed01arch>.
>> However, it needed (and needs AFAIK) OCR'ing and then proofing.
>>
>> Back in the day I took a stab at this using tesseract plus shell scripts
>> (code now here https://github.com/okfn/oed) but it wasn't great:
>>
>> - Tesseract quality or non-standard dictionary text was poor
>> - Chopping up pages (both individual and columns from pages) needed
>> bespoke automation and was error-prone
>> - Not clear what best way was to do proofing once done (for the work for
>> Open Shakespeare and Encyclopaedia Britannica we just used a wiki)
>>
>> Things have obviously moved on in the last 5 years and I was wondering
>> what's the *best tools to use for this today (e.g. is tesseract still
>> the best open-source option).*
>>
>> Rufus
>>
>> PS: if you're also interested in this project please let me know :-)
>>
>> [1]: https://github.com/okfn/ideas/issues/50
>>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>>
>>
>


-- 
*

Rufus Pollock

Founder and Co-Director | skype: rufuspollock |
@rufuspollock<https://twitter.com/rufuspollock>

The Open Knowledge Foundation <http://okfn.org/>

Empowering through Open Knowledge
http://okfn.org/ | @okfn <http://twitter.com/OKFN> | OKF on
Facebook<https://www.facebook.com/OKFNetwork>|
Blog <http://blog.okfn.org/>  |  Newsletter<http://okfn.org/about/newsletter>

*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130621/76c52e2e/attachment-0002.html>


More information about the okfn-labs mailing list