[School-of-data] PDF Extraction Tools

Thu Dec 20 13:53:51 UTC 2012

On Thu, Dec 20, 2012 at 5:54 AM, Michael Bauer <michael.bauer at okfn.org> wrote:

> For the next School of Data tutorial I would like to cover Data extraction
> from PDFs (and text) as well as OCR.
>
> Does anyone here have experience with Tools that do not require coding
> skills to extract data and text from PDFs?

The two main ones I'm familiar with for dealing with text are PDFBox
from Apache & pdftotext, but depending on what you want to do, you may
be better off going to HTML (pdftohtml) or Excel.  There's also
PDF2Table which takes the output of pdftohtml and does table
extraction on it for table heavy docs. You can also upload the PDF to
Google Docs and ask it to convert it for you.  Of course, Adobe and
others have commercial products which will do this stuff.  If the PDF
consists of scanned images, you'll need to OCR the images to get the
text/data back.

I haven't tried it yet, but Adobe recently introduced an online
service that will do all this, including OCR of images, for $20/yr.
http://blogs.adobe.com/acom/2011/05/introducing-adobe-export-pdf.html
I'd be very interested to hear what results people have had with it.

On Thu, Dec 20, 2012 at 6:27 AM, Tom Longley <tom at tacticaltech.org> wrote:

> I wrote this earlier in the year:
> http://drawingbynumbers.org/data-design-basics/note-3-opening-open-data#anchor-5
> There's some stuff about different tools and no much complimentary to
> say about  OCR.

I enjoy much of Drawing by Numbers, but that page's treatment of OCR
and PDF could be improved. The section on OCR extrapolates from a
single point (consisting of a skewed, poorly scanned, multi-generation
copy of an invoice, no less) to blackball all of OCR.

The section on PDFs seems to imply a programmer is always required
(and doesn't mention what libraries the programmer would use).

Tom