[open-science] software to extract text from pdf

Thu Jun 20 13:07:37 UTC 2013

I haven't used it, but asked and got a recommendation for
http://www.pdflib.com/download/free-software/.

On Thu, Jun 20, 2013 at 3:25 AM, Donat Agosti <agosti at amnh.org> wrote:

> Dear Michelle****
>
> ** **
>
> Thanks for this – so far we used ABBYY (also a prorietory software) but
> they do not offer text extraction from pdfs, and there is a limitation on
> adding page breaks (that worked to their version 8, and that's why we
> decided to use ABBYY and not Omnipage.****
>
> ** **
>
> Do you know, whether Omnipage allows batch processing? Any idea how well
> it extracts tables etc.?****
>
> ** **
>
> Still, an open sources would be welcome…****
>
> ** **
>
> Cheers****
>
> ** **
>
> Donat****
>
> ** **
>
> ** **
>
> ** **
>
> *From:* Michelle Willmers [mailto:michelle.willmers at uct.ac.za]
> *Sent:* Thursday, June 20, 2013 11:51 AM
> *To:* Donat Agosti; open-science at lists.okfn.org
> *Cc:* Terry Catapano
> *Subject:* Re: [open-science] software to extract text from pdf****
>
> ** **
>
> Dear Donat****
>
> ** **
>
> I was very interested in your query and asked a colleague who has recently
> been engaged in this exact process (for similar reason). She utilised a
> proprietory software package called Omnipage … and offered the comment: "No
> good open source alternative that can give the same level of accuracy and
> conversion power that I know of."****
>
> ** **
>
> We would be very interested to know if anyone has better (free, open)
> suggestions.****
>
> ** **
>
> Michelle****
>
> ** **
>
> -- ****
>
> Michelle Willmers****
>
> Project Manager****
>
> OpenUCT Initiative****
>
> University of Cape Town****
>
> South Africa****
>
> Tel:+27(21) 650 5061****
>
> Cell: 082 229 4262****
>
> http://openuct.uct.ac.za/ <http://www.scaprogramme.org.za/>****
>
> Twitter: @SCAprogramme****
>
> ** **
>
> *From: *Donat Agosti <agosti at amnh.org>
> *Date: *Thu, 20 Jun 2013 11:09:19 +0430
> *To: *<open-science at lists.okfn.org>
> *Cc: *Terry Catapano <thc4ster at gmail.com>
> *Subject: *[open-science] software to extract text from pdf****
>
> ** **
>
> Dear all****
>
>  ****
>
> We work on a project to convert taxonomic publications into semantically
> enhanced linked xml documents (and ultimately to add them to our databases
> at http://plazi.org)****
>
>  ****
>
> There are principally two import formats, pdf with images (or scanned
> images of text) and born-digital pdfs. ****
>
>  ****
>
> For the latter, we would like to find an open source tool that allows the
> extraction of text from the pdf, with a constraints: ****
>
> We need the page numbers and breaks (to be able to define the position of
> text blocks since in our world the citations lead back to the page where a
> taxon is being described, that is the treatment), and thus the output has
> to include this.****
>
> We deal with many tables. They have to be dealt with. We have not found
> tools that deliver this.****
>
> We have many special sympols ☿ ♀ ♂ ♃ É Î æ Æ ****
>
> We would like to run batch files to produce automated output.****
>
>  ****
>
> Has anybody an idea where we could find this?****
>
>  ****
>
> The focus on treatment is, because this is the taxonomist's currency, and
> because we can deal with them due to their nature that does not qualify
> them as work (in a legal sense) and thus are out of copyright.****
>
>  ****
>
> Best thanks for a hint****
>
>  ****
>
> Donat****
>
> Plazi****
>
>  ****
>
> _______________________________________________ open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science Unsubscribe:
> http://lists.okfn.org/mailman/options/open-science ****
> ------------------------------
>
> UNIVERSITY OF CAPE TOWN
>
> This e-mail is subject to the UCT ICT policies and e-mail disclaimer
> published on our website at
> http://www.uct.ac.za/about/policies/emaildisclaimer/ or obtainable from +27
> 21 650 9111. This e-mail is intended only for the person(s) to whom it is
> addressed. If the e-mail has reached you in error, please notify the
> author. If you are not the intended recipient of the e-mail you may not
> use, disclose, copy, redirect or print the content. If this e-mail is not
> related to the business of UCT it is sent by the sender in the sender's
> individual capacity. ****
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130620/66e0de49/attachment-0001.html>