[open-science] software to extract text from pdf

Sun Jun 23 12:35:28 UTC 2013

On Thu, Jun 20, 2013 at 3:11 PM, Christian Pietsch <
chr.pietsch+okfn at gmail.com> wrote:

> On Thu, Jun 20, 2013 at 01:28:02PM -0500, Bryan Bishop wrote:
> > Most of the time, tesseract fails out of the box.
>
> I am not sure in what way tesseract failed for you. The way it
> processes command line options is rather finicky, and its error
> messages are not very helpful. That is why I patched a friendly little
> Ruby script called pdfocr to make it work with tesseract. It takes a
> PDF containing only scanned page images, runs an OCR engine on it, and
> adds the recognized text as a background layer to the PDF. My patch
> has been integrated into the main line of development three months
> ago: https://github.com/gkovacs/pdfocr
>
> > The tesseract documentation recommends running lots of training on
> > the specific types of data you are scanning.
>
> I never needed to train tesseract. Models for lots of languages are
> freely available.
>

I agree with Christian.  The very broad statements by Bryan don't appear to
be supported by results reported by people on the tesseract mailing list.
 People who are training Tesseract are doing so to support new languages
like Ancient Greek or exotic fonts.  Tesseract supports 63 languages out of
the box including Chinese, Telugu, and a whole bunch of other non-European
languages.

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130623/c89798fa/attachment-0001.html>