[open-science] software to extract text from pdf

Bryan Bishop kanzure at gmail.com
Thu Jun 20 18:28:02 UTC 2013


On Thu, Jun 20, 2013 at 1:13 PM, sheila miguez <shekay at pobox.com> wrote:

> It uses tesseract, and I don't know if they do more or less than what I
> got trying to use tesseract by hand -- and I wasn't trying to scan tables,
> I was just scanning citations.
>

Most of the time, tesseract fails out of the box. The tesseract
documentation recommends running lots of training on the specific types of
data you are scanning. I haven't put the time into doing this yet. It would
be nice to eventually scan some of the CRC handbook data.

- Bryan
http://heybryan.org/
1 512 203 0507
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130620/b9c80d99/attachment-0001.html>


More information about the open-science mailing list