[School-of-data] PDF Extraction Tools
Tom Morris
tfmorris at gmail.com
Thu Dec 20 14:04:03 UTC 2012
On Thu, Dec 20, 2012 at 7:12 AM, Ola Løvholm <ola at lovholm.net> wrote:
> I'm currently working on a PDF extraction project for the NIME conference
> series. I use a combination of Python, Tika and Bibtex. Tika
> (http://tika.apache.org/) is a great tool, and by using the terminal and
> call tika with a --text argument returns text from the pdf file. The
> work-in-progress code is available on
> GitHub:https://github.com/olovholm/NIME. Main part of the code is in the
> pdfextractor.py script.
In this scenario I think Tika is just a wrapper for PDFBox (also from
Apache) without adding any additional functionality, so you could
probably use it directly.
For your metadata extraction task I think you're losing valuable
information when the formatting & layout is lost. You might want to
look at either using PDFBox's HTML output or the XML output from
pdftohtml to get more information.
There's also a pure Python solution in PDFMiner
(http://www.unixuser.org/~euske/python/pdfminer/index.html). It says
it's 20x slower than the C/Java solutions, but for a one time
extraction, that's probably not a big deal.
Tom
More information about the school-of-data
mailing list