[open-science] software to extract text from pdf

Thu Jun 20 07:20:56 UTC 2013

Dear Donat

I was very interested in your query and asked a colleague who has recently been engaged in this exact process (for similar reason). She utilised a proprietory software package called Omnipage … and offered the comment: "No good open source alternative that can give the same level of accuracy and conversion power that I know of."

We would be very interested to know if anyone has better (free, open) suggestions.

Michelle

--
Michelle Willmers
Project Manager
OpenUCT Initiative
University of Cape Town
South Africa
Tel:+27(21) 650 5061
Cell: 082 229 4262
http://openuct.uct.ac.za/<http://www.scaprogramme.org.za/>
<http://www.scaprogramme.org.za/>Twitter: @SCAprogramme

From: Donat Agosti <agosti at amnh.org<mailto:agosti at amnh.org>>
Date: Thu, 20 Jun 2013 11:09:19 +0430
To: <open-science at lists.okfn.org<mailto:open-science at lists.okfn.org>>
Cc: Terry Catapano <thc4ster at gmail.com<mailto:thc4ster at gmail.com>>
Subject: [open-science] software to extract text from pdf

Dear all

We work on a project to convert taxonomic publications into semantically enhanced linked xml documents (and ultimately to add them to our databases at http://plazi.org)

There are principally two import formats, pdf with images (or scanned images of text) and born-digital pdfs.

For the latter, we would like to find an open source tool that allows the extraction of text from the pdf, with a constraints:
We need the page numbers and breaks (to be able to define the position of text blocks since in our world the citations lead back to the page where a taxon is being described, that is the treatment), and thus the output has to include this.
We deal with many tables. They have to be dealt with. We have not found tools that deliver this.
We have many special sympols ☿ ♀ ♂ ♃ É Î æ Æ
We would like to run batch files to produce automated output.

Has anybody an idea where we could find this?

The focus on treatment is, because this is the taxonomist's currency, and because we can deal with them due to their nature that does not qualify them as work (in a legal sense) and thus are out of copyright.

Best thanks for a hint

Donat
Plazi

_______________________________________________ open-science mailing list open-science at lists.okfn.org<mailto:open-science at lists.okfn.org> http://lists.okfn.org/mailman/listinfo/open-science Unsubscribe: http://lists.okfn.org/mailman/options/open-science
________________________________
UNIVERSITY OF CAPE TOWN

This e-mail is subject to the UCT ICT policies and e-mail disclaimer published on our website at http://www.uct.ac.za/about/policies/emaildisclaimer/ or obtainable from +27 21 650 9111. This e-mail is intended only for the person(s) to whom it is addressed. If the e-mail has reached you in error, please notify the author. If you are not the intended recipient of the e-mail you may not use, disclose, copy, redirect or print the content. If this e-mail is not related to the business of UCT it is sent by the sender in the sender's individual capacity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130620/3225a682/attachment-0001.html>