[open-science] software to extract text from pdf

Thu Jun 20 16:47:44 UTC 2013

Hi,

How do you want to use tools? Do you want to incorporate them in to some
software you are building, or would you just like to have something that
you can use without working about adding it to something else?

For getting tables from pdfs there is tabula running as a web service.

http://source.mozillaopennews.org/en-US/articles/introducing-tabula/

I used it via starting up an instance of their ami in amazon, and then had
to edit some settings to make it accept more than the default size of pdf,
since the pdf I wanted to use was larger than the default. I don't know how
accessible this is to you, is it something that sounds reasonably doable?

For text from images, I tried tesseract-ocr, with imagemagick, and bash. I
am definitely not an expert and did not finesse anything in this case. I
did this from an ubuntu machine, and found an ubuntu page on OCR

https://help.ubuntu.com/community/OCR

The page had a bash script, which I destroyed until it looked like so since
I didn't need everything in their script. It is embarassing really. I was
lazy since I didn't want to type everything at the command line. It is
really dumb and not intelligent about checking args, or letting you have
more options, and it clobbers files and so on.

#!/bin/bash

if [ -z "$1" ]; then
    echo "usage: $0 pdfname"
    exit
fi

convert -monochrome -density 300 $1 pdf.tif
tesseract pdf.tif outputfile

On Thu, Jun 20, 2013 at 1:39 AM, Donat Agosti <agosti at amnh.org> wrote:

> Dear all****
>
> ** **
>
> We work on a project to convert taxonomic publications into semantically
> enhanced linked xml documents (and ultimately to add them to our databases
> at http://plazi.org)****
>
> ** **
>
> There are principally two import formats, pdf with images (or scanned
> images of text) and born-digital pdfs. ****
>
> ** **
>
> For the latter, we would like to find an open source tool that allows the
> extraction of text from the pdf, with a constraints: ****
>
> We need the page numbers and breaks (to be able to define the position of
> text blocks since in our world the citations lead back to the page where a
> taxon is being described, that is the treatment), and thus the output has
> to include this.****
>
> We deal with many tables. They have to be dealt with. We have not found
> tools that deliver this.****
>
> We have many special sympols ☿ ♀ ♂ ♃ É Î æ Æ ****
>
> We would like to run batch files to produce automated output.****
>
> ** **
>
> Has anybody an idea where we could find this?****
>
> ** **
>
> The focus on treatment is, because this is the taxonomist's currency, and
> because we can deal with them due to their nature that does not qualify
> them as work (in a legal sense) and thus are out of copyright.****
>
> ** **
>
> Best thanks for a hint****
>
> ** **
>
> Donat****
>
> Plazi****
>
> ** **
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>
>

-- 
sheila
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130620/1c45883c/attachment-0001.html>