[School-of-data] PDF Extraction Tools

Ola Løvholm ola at lovholm.net
Thu Dec 20 12:12:04 UTC 2012


I'm currently working on a PDF extraction project for the NIME conference
series. I use a combination of Python, Tika and Bibtex. Tika (
http://tika.apache.org/) is a great tool, and by using the terminal and
call tika with a --text argument returns text from the pdf file. The
work-in-progress code is available on GitHub:
https://github.com/olovholm/NIME. Main part of the code is in the
pdfextractor.py script.

- Ola Løvholm

2012/12/20 Michael Bauer <michael.bauer at okfn.org>

> Hi,
> For the next School of Data tutorial I would like to cover Data extraction
> from PDFs (and text) as well as OCR.
> Does anyone here have experience with Tools that do not require coding
> skills to extract data and text from PDFs?
> Which tools do you use?
> Michael
> --
> Data Wrangler with the Open Knowledge Foundation (OKFN.org)
> GPG/PGP key: http://tentacleriot.eu/mihi.asc
> Twitter: @mihi_tr Skype: mihi_tr
> _______________________________________________
> School-of-data mailing list
> School-of-data at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/school-of-data
> Unsubscribe: http://lists.okfn.org/mailman/options/school-of-data

Ola Løvholm
MSc by Research in Digital Media and Culture

Webpage: http://www.lovholm.net
Skype/Twitter/Facebook: olovholm
Mobile (NO): (+47) 41925090
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20121220/c96066e7/attachment-0001.html>

More information about the school-of-data mailing list