[open-science] software to extract text from pdf

Bryan Bishop kanzure at gmail.com
Thu Jun 20 18:28:02 UTC 2013

On Thu, Jun 20, 2013 at 1:13 PM, sheila miguez <shekay at pobox.com> wrote:

> It uses tesseract, and I don't know if they do more or less than what I
> got trying to use tesseract by hand -- and I wasn't trying to scan tables,
> I was just scanning citations.

Most of the time, tesseract fails out of the box. The tesseract
documentation recommends running lots of training on the specific types of
data you are scanning. I haven't put the time into doing this yet. It would
be nice to eventually scan some of the CRC handbook data.

- Bryan
1 512 203 0507
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130620/b9c80d99/attachment-0001.html>

More information about the open-science mailing list