[open-science] software to extract text
chr.pietsch+okfn at gmail.com
Thu Jun 20 13:19:19 UTC 2013
On Thu, Jun 20, 2013 at 11:09:19 +0430, Donat Agosti wrote:
> There are principally two import formats, pdf with images (or scanned
> of text) and born-digital pdfs.
> For the latter, we would like to find an open source tool that allows the
> extraction of text from the pdf, with a constraints:
> We need the page numbers and breaks (to be able to define the position of
> text blocks since in our world the citations lead back to the page where a
> taxon is being described, that is the treatment), and thus the output has
> include this.
Have you looked at pdftotext? This seems to be the open source tool
most people use on Linux/Unix. On Debian-based systems, it is included
in the poppler-utils package. A Windows version can be downloaded here:
By default, pdftotext inserts an ASCII formfeed character (decimal 12,
hexadecimal 0xC, sometimes written as ^L) to indicate line breaks. If
you know the page number of the first page, you can use this character
to determine all subsequent page numbers.
> We deal with many tables. They have to be dealt with. We have not found
> tools that deliver this.
Tables are beyond the capabilities of pdftotext. It puts each table
cell on a new line, it seems, and it is hard to spot where a row
started or ended.
PDFX might be able to do more: http://pdfx.cs.man.ac.uk/
Unfortunately, I could not find a download link there, and the web
service is limited to documents of at most 100 pages.
Luckily, there is an open source alternative: LA-PDFText
These and other tools relevant to scientific communication are listed
I hope this helps.
Christian Pietsch · http://purl.org/net/pietsch
LibTec · Library Technology and Knowledge Management
Bielefeld University, Universitätsstr. 25, 33615 Bielefeld, Germany
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the open-science