[open-science] software to extract text

Thu Jun 20 13:19:19 UTC 2013

Dear Donat,

On Thu, Jun 20, 2013 at 11:09:19 +0430, Donat Agosti wrote:
> There are principally two import formats, pdf with images (or scanned
images
> of text) and born-digital pdfs.
>
> For the latter, we would like to find an open source tool that allows the
> extraction of text from the pdf, with a constraints:
>
> We need the page numbers and breaks (to be able to define the position of
> text blocks since in our world the citations lead back to the page where a
> taxon is being described, that is the treatment), and thus the output has
to
> include this.

Have you looked at pdftotext? This seems to be the open source tool
most people use on Linux/Unix. On Debian-based systems, it is included
in the poppler-utils package. A Windows version can be downloaded here:
http://www.foolabs.com/xpdf/download.html

By default, pdftotext inserts an ASCII formfeed character (decimal 12,
hexadecimal 0xC, sometimes written as ^L) to indicate line breaks. If
you know the page number of the first page, you can use this character
to determine all subsequent page numbers.

> We deal with many tables. They have to be dealt with. We have not found
> tools that deliver this.

Tables are beyond the capabilities of pdftotext. It puts each table
cell on a new line, it seems, and it is hard to spot where a row
started or ended.

PDFX might be able to do more: http://pdfx.cs.man.ac.uk/
Unfortunately, I could not find a download link there, and the web
service is limited to documents of at most 100 pages.

Luckily, there is an open source alternative: LA-PDFText
https://code.google.com/p/lapdftext/

These and other tools relevant to scientific communication are listed
here: http://www.force11.org/tools

I hope this helps.
Christian

-- 
  Christian Pietsch · http://purl.org/net/pietsch
  LibTec · Library Technology and Knowledge Management
  Bielefeld University, Universitätsstr. 25, 33615 Bielefeld, Germany
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130620/681a1fcf/attachment.html>