[open-science] software to extract text from pdf

Christian Pietsch chr.pietsch+okfn at gmail.com
Thu Jun 20 19:11:07 UTC 2013


On Thu, Jun 20, 2013 at 01:28:02PM -0500, Bryan Bishop wrote:
> Most of the time, tesseract fails out of the box.

I am not sure in what way tesseract failed for you. The way it
processes command line options is rather finicky, and its error
messages are not very helpful. That is why I patched a friendly little
Ruby script called pdfocr to make it work with tesseract. It takes a
PDF containing only scanned page images, runs an OCR engine on it, and
adds the recognized text as a background layer to the PDF. My patch
has been integrated into the main line of development three months
ago: https://github.com/gkovacs/pdfocr

> The tesseract documentation recommends running lots of training on
> the specific types of data you are scanning.

I never needed to train tesseract. Models for lots of languages are
freely available.

Cheers,
Christian

-- 
  Christian Pietsch · http://purl.org/net/pietsch
  LibTec · Library Technology and Knowledge Management
  Bielefeld University, Universitätsstr. 25, 33615 Bielefeld, Germany

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130620/ed82bfb3/attachment-0001.sig>


More information about the open-science mailing list