[okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Fri Jun 21 19:22:55 UTC 2013

p.s.  The other big difference between OCRpus and Tesseract is the size and
activity of the development community.  The primary for developers for both
are pretty unresponsive, but Tesseract has a much more active community of
people who use it, train it for different languages & scripts (Ancient
Greek, Fraktur, etc), write tools to make training easier, etc.

Tom

On Fri, Jun 21, 2013 at 3:19 PM, Tom Morris <tfmorris at gmail.com> wrote:

> Everything that gets uploaded to the Internet Archive (should) get OCR'd
> automatically.  You can see all the different file formats here:
> https://ia600401.us.archive.org/7/items/oed01arch/
>
> The PDF, ePUB, and DJVu formats should all have text in some form or
> another.
>
> For open source OCR, I think Tesseract leads the bunch with the other main
> contender being OCRpus.  As Tim mentioned you can mix and match components
> from the two.  OCRpus used to use Tesseract, but now has its own
> recognition engine.
>
> For a dictionary, you'll probably want to re-train to get support for IPA
> (or whatever phonetic alphabet they use).  Both Tess & OCRpus support
> training, but, unfortunately, the best recognition mode in Tess (Cube)
> doesn't yet have support for training (Google trained all the distributed
> models themselves internally, but hasn't released the training tools).
>
> For proofreading/correction, the two best options which come to mind are
> the Distributed Proofreads projects which is part of Project Gutenberg and
> WikiSource.
>
> Tom
>
>
> On Thu, Jun 20, 2013 at 3:34 PM, Rufus Pollock <rufus.pollock at okfn.org>wrote:
>
>> Hi All,
>>
>> I'm writing to ask folks to share their recommendations on* workflows
>> and open tools for extracting (machine-readable) text* from *(bulk)
>> scanned text* (i.e. OCR etc).
>>
>> (Note I'm also, like many other folks, interested in extracting
>> structured data (e.g. tables) from normal or scanned PDF but I'm *not *asking
>> about that in this thread ...)
>>
>>  *Context - or the problems I'm interested in*
>>
>> I've been interested in making the 1st edition of the Oxford English
>> Dictionary (now in public domain) available online in an open form for a
>> while - here's the entry in the Labs Ideas tracker<https://github.com/okfn/ideas/issues/50> about
>> this which details some of the history [1].
>>
>> Key point is that thanks to Kragen Sitaker's work in 2005/2006 the whole
>> text got scanned and is uploaded to archive.org<http://archive.org/details/oed01arch>.
>> However, it needed (and needs AFAIK) OCR'ing and then proofing.
>>
>> Back in the day I took a stab at this using tesseract plus shell scripts
>> (code now here https://github.com/okfn/oed) but it wasn't great:
>>
>> - Tesseract quality or non-standard dictionary text was poor
>> - Chopping up pages (both individual and columns from pages) needed
>> bespoke automation and was error-prone
>> - Not clear what best way was to do proofing once done (for the work for
>> Open Shakespeare and Encyclopaedia Britannica we just used a wiki)
>>
>> Things have obviously moved on in the last 5 years and I was wondering
>> what's the *best tools to use for this today (e.g. is tesseract still
>> the best open-source option).*
>>
>> Rufus
>>
>> PS: if you're also interested in this project please let me know :-)
>>
>> [1]: https://github.com/okfn/ideas/issues/50
>>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130621/fcd0809c/attachment-0002.html>