[open-science] software to extract text from pdf

Thu Jun 20 06:39:19 UTC 2013

Dear all

We work on a project to convert taxonomic publications into semantically enhanced linked xml documents (and ultimately to add them to our databases at http://plazi.org)

There are principally two import formats, pdf with images (or scanned images of text) and born-digital pdfs. 

For the latter, we would like to find an open source tool that allows the extraction of text from the pdf, with a constraints: 

We need the page numbers and breaks (to be able to define the position of text blocks since in our world the citations lead back to the page where a taxon is being described, that is the treatment), and thus the output has to include this.

We deal with many tables. They have to be dealt with. We have not found tools that deliver this.

We have many special sympols ☿ ♀ ♂ ♃ É Î æ Æ 

We would like to run batch files to produce automated output.

Has anybody an idea where we could find this?

The focus on treatment is, because this is the taxonomist's currency, and because we can deal with them due to their nature that does not qualify them as work (in a legal sense) and thus are out of copyright.

Best thanks for a hint

Donat

Plazi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130620/ad713e7b/attachment.html>