[open-science] software to extract text from pdf

Thu Jun 20 07:25:25 UTC 2013

Dear Michelle

Thanks for this – so far we used ABBYY (also a prorietory software) but they do not offer text extraction from pdfs, and there is a limitation on adding page breaks (that worked to their version 8, and that's why we decided to use ABBYY and not Omnipage.

Do you know, whether Omnipage allows batch processing? Any idea how well it extracts tables etc.?

Still, an open sources would be welcome…

Cheers

Donat

From: Michelle Willmers [mailto:michelle.willmers at uct.ac.za] 
Sent: Thursday, June 20, 2013 11:51 AM
To: Donat Agosti; open-science at lists.okfn.org
Cc: Terry Catapano
Subject: Re: [open-science] software to extract text from pdf

Dear Donat

I was very interested in your query and asked a colleague who has recently been engaged in this exact process (for similar reason). She utilised a proprietory software package called Omnipage … and offered the comment: "No good open source alternative that can give the same level of accuracy and conversion power that I know of."

We would be very interested to know if anyone has better (free, open) suggestions.

Michelle

-- 

Michelle Willmers

Project Manager

OpenUCT Initiative

University of Cape Town

South Africa

Tel:+27(21) 650 5061

Cell: 082 229 4262

http://openuct.uct.ac.za/ <http://www.scaprogramme.org.za/> 

Twitter: @SCAprogramme

From: Donat Agosti <agosti at amnh.org>
Date: Thu, 20 Jun 2013 11:09:19 +0430
To: <open-science at lists.okfn.org>
Cc: Terry Catapano <thc4ster at gmail.com>
Subject: [open-science] software to extract text from pdf

Dear all

We work on a project to convert taxonomic publications into semantically enhanced linked xml documents (and ultimately to add them to our databases at http://plazi.org)

There are principally two import formats, pdf with images (or scanned images of text) and born-digital pdfs. 

For the latter, we would like to find an open source tool that allows the extraction of text from the pdf, with a constraints: 

We need the page numbers and breaks (to be able to define the position of text blocks since in our world the citations lead back to the page where a taxon is being described, that is the treatment), and thus the output has to include this.

We deal with many tables. They have to be dealt with. We have not found tools that deliver this.

We have many special sympols ☿ ♀ ♂ ♃ É Î æ Æ 

We would like to run batch files to produce automated output.

Has anybody an idea where we could find this?

The focus on treatment is, because this is the taxonomist's currency, and because we can deal with them due to their nature that does not qualify them as work (in a legal sense) and thus are out of copyright.

Best thanks for a hint

Donat

Plazi

_______________________________________________ open-science mailing list open-science at lists.okfn.org http://lists.okfn.org/mailman/listinfo/open-science Unsubscribe: http://lists.okfn.org/mailman/options/open-science 

  _____  

UNIVERSITY OF CAPE TOWN 

This e-mail is subject to the UCT ICT policies and e-mail disclaimer published on our website at http://www.uct.ac.za/about/policies/emaildisclaimer/ or obtainable from +27 21 650 9111. This e-mail is intended only for the person(s) to whom it is addressed. If the e-mail has reached you in error, please notify the author. If you are not the intended recipient of the e-mail you may not use, disclose, copy, redirect or print the content. If this e-mail is not related to the business of UCT it is sent by the sender in the sender's individual capacity. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130620/705c5288/attachment-0001.html>