[open-science] software to extract text from pdf

Peter Murray-Rust pm286 at cam.ac.uk
Thu Jun 20 17:14:43 UTC 2013


A number of us have been developing Open Source software for exactly this
purpose. There is a general project "Jailbreaking the PDF" which has
brought together about 5-7 groups , all of whom have different things to
contribute. Here's my account:
http://blogs.ch.cam.ac.uk/pmr/2013/05/28/jailbreaking-the-pdf-a-wonderful-hackathon-and-a-community-leap-forward-for-freedom-1/

This community effort is tackling PDFs for science, but in addition Ross
Mounce and I are specifically targetting phylogenetics and (by implication)
taxonomy. We have prototyped methods to extract trees from diagrams and
species from text.

Note that the PDF quality of scholarly articles is often very non-standard.
You mention a range of symbols - it is highly unlikely that academic
publishers use Unicode for these and they need heuristic translation - I
would not expect commercial packages to do a good job on some academic
publishers' symbols. Our community - which anyone can join is growing and
is probably the best place to develop solutions.

Our own contribution which deals with character conversion is
http://www.bitbucket.org/petermr/pdf2svg-dev. This converts the raw PDF to
Unicode as far as is possible (we often require per-publisher tables).

I would be very happy to look at your examples, but am away from the
interent for a few days.

P.



On Thu, Jun 20, 2013 at 2:07 PM, Shauna Gordon-McKeon <shaunagm at gmail.com>wrote:

> I haven't used it, but asked and got a recommendation for
> http://www.pdflib.com/download/free-software/.
>
>
> On Thu, Jun 20, 2013 at 3:25 AM, Donat Agosti <agosti at amnh.org> wrote:
>
>> Dear Michelle****
>>
>> ** **
>>
>> Thanks for this – so far we used ABBYY (also a prorietory software) but
>> they do not offer text extraction from pdfs, and there is a limitation on
>> adding page breaks (that worked to their version 8, and that's why we
>> decided to use ABBYY and not Omnipage.****
>>
>> ** **
>>
>> Do you know, whether Omnipage allows batch processing? Any idea how well
>> it extracts tables etc.?****
>>
>> ** **
>>
>> Still, an open sources would be welcome…****
>>
>> ** **
>>
>> Cheers****
>>
>> ** **
>>
>> Donat****
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> *From:* Michelle Willmers [mailto:michelle.willmers at uct.ac.za]
>> *Sent:* Thursday, June 20, 2013 11:51 AM
>> *To:* Donat Agosti; open-science at lists.okfn.org
>> *Cc:* Terry Catapano
>> *Subject:* Re: [open-science] software to extract text from pdf****
>>
>> ** **
>>
>> Dear Donat****
>>
>> ** **
>>
>> I was very interested in your query and asked a colleague who has
>> recently been engaged in this exact process (for similar reason). She
>> utilised a proprietory software package called Omnipage … and offered the
>> comment: "No good open source alternative that can give the same level of
>> accuracy and conversion power that I know of."****
>>
>> ** **
>>
>> We would be very interested to know if anyone has better (free, open)
>> suggestions.****
>>
>> ** **
>>
>> Michelle****
>>
>> ** **
>>
>> -- ****
>>
>> Michelle Willmers****
>>
>> Project Manager****
>>
>> OpenUCT Initiative****
>>
>> University of Cape Town****
>>
>> South Africa****
>>
>> Tel:+27(21) 650 5061****
>>
>> Cell: 082 229 4262****
>>
>> http://openuct.uct.ac.za/ <http://www.scaprogramme.org.za/>****
>>
>> Twitter: @SCAprogramme****
>>
>> ** **
>>
>> *From: *Donat Agosti <agosti at amnh.org>
>> *Date: *Thu, 20 Jun 2013 11:09:19 +0430
>> *To: *<open-science at lists.okfn.org>
>> *Cc: *Terry Catapano <thc4ster at gmail.com>
>> *Subject: *[open-science] software to extract text from pdf****
>>
>> ** **
>>
>> Dear all****
>>
>>  ****
>>
>> We work on a project to convert taxonomic publications into semantically
>> enhanced linked xml documents (and ultimately to add them to our databases
>> at http://plazi.org)****
>>
>>  ****
>>
>> There are principally two import formats, pdf with images (or scanned
>> images of text) and born-digital pdfs. ****
>>
>>  ****
>>
>> For the latter, we would like to find an open source tool that allows the
>> extraction of text from the pdf, with a constraints: ****
>>
>> We need the page numbers and breaks (to be able to define the position of
>> text blocks since in our world the citations lead back to the page where a
>> taxon is being described, that is the treatment), and thus the output has
>> to include this.****
>>
>> We deal with many tables. They have to be dealt with. We have not found
>> tools that deliver this.****
>>
>> We have many special sympols ☿ ♀ ♂ ♃ É Î æ Æ ****
>>
>> We would like to run batch files to produce automated output.****
>>
>>  ****
>>
>> Has anybody an idea where we could find this?****
>>
>>  ****
>>
>> The focus on treatment is, because this is the taxonomist's currency, and
>> because we can deal with them due to their nature that does not qualify
>> them as work (in a legal sense) and thus are out of copyright.****
>>
>>  ****
>>
>> Best thanks for a hint****
>>
>>  ****
>>
>> Donat****
>>
>> Plazi****
>>
>>  ****
>>
>> _______________________________________________ open-science mailing list
>> open-science at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-science Unsubscribe:
>> http://lists.okfn.org/mailman/options/open-science ****
>> ------------------------------
>>
>> UNIVERSITY OF CAPE TOWN
>>
>> This e-mail is subject to the UCT ICT policies and e-mail disclaimer
>> published on our website at
>> http://www.uct.ac.za/about/policies/emaildisclaimer/ or obtainable from +27
>> 21 650 9111. This e-mail is intended only for the person(s) to whom it
>> is addressed. If the e-mail has reached you in error, please notify the
>> author. If you are not the intended recipient of the e-mail you may not
>> use, disclose, copy, redirect or print the content. If this e-mail is not
>> related to the business of UCT it is sent by the sender in the sender's
>> individual capacity. ****
>>
>> _______________________________________________
>> open-science mailing list
>> open-science at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-science
>> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>>
>>
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130620/3310b7dd/attachment-0001.html>


More information about the open-science mailing list