[open-science] software to extract text from pdf

Thu Jun 20 18:13:21 UTC 2013

On Thu, Jun 20, 2013 at 12:54 PM, Bryan Bishop <kanzure at gmail.com> wrote:

> On Thu, Jun 20, 2013 at 11:47 AM, sheila miguez <shekay at pobox.com> wrote:
>
>> For getting tables from pdfs there is tabula running as a web service.
>>
>> http://source.mozillaopennews.org/en-US/articles/introducing-tabula/
>>
>
> Is there one that works on scanned pdfs?
>
> - Bryan
> http://heybryan.org/
> 1 512 203 0507

In the link there they suggest trying dochive, but can't give any report on
that. I haven't tried it.

"Tabula only works on text-based PDFs only, so you’re still stuck with
manual labor if you have scanned PDFs. Free OCR technology is not quite to
the point where we’d trust automating it with many pages of data. For those
files, Raleigh Public Record’s DocHive is worth a look."
http://www.raleighpublicrecord.org/
https://github.com/raleighpublicrecord/dochive/tree/master/dochive

It uses tesseract, and I don't know if they do more or less than what I got
trying to use tesseract by hand -- and I wasn't trying to scan tables, I
was just scanning citations.
-- 
sheila
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130620/c6009ee8/attachment-0001.html>