[School-of-data] PDF Extraction Tools

Thu Dec 20 14:04:03 UTC 2012

On Thu, Dec 20, 2012 at 7:12 AM, Ola Løvholm <ola at lovholm.net> wrote:

> I'm currently working on a PDF extraction project for the NIME conference
> series. I use a combination of Python, Tika and Bibtex. Tika
> (http://tika.apache.org/) is a great tool, and by using the terminal and
> call tika with a --text argument returns text from the pdf file. The
> work-in-progress code is available on
> GitHub:https://github.com/olovholm/NIME. Main part of the code is in the
> pdfextractor.py script.

In this scenario I think Tika is just a wrapper for PDFBox (also from
Apache) without adding any additional functionality, so you could
probably use it directly.

For your metadata extraction task I think you're losing valuable
information when the formatting & layout is lost.  You might want to
look at either using PDFBox's HTML output or the XML output from
pdftohtml to get more information.

There's also a pure Python solution in PDFMiner
(http://www.unixuser.org/~euske/python/pdfminer/index.html).  It says
it's 20x slower than the C/Java solutions, but for a one time
extraction, that's probably not a big deal.

Tom