[okfn-labs] scrapping PDF files

Enric Garcia Torrents enricgarcia at uoc.edu
Fri Dec 6 04:20:57 UTC 2013

You can also use bash (Unix shell) script, if using Linux / OS-X:

for file in *.pdf ; do pdftotext -layout "$file" ; done

Enric G. Torrents
Email: e.g.cn at ieee.org
Tel.: +8613122141470
Skype: torrents.enric

--- Missatge original de Thomas Levine per a Seth Woodworth ,Tarek Amr amb còpia a okfn-labs enviat el 06.12.2013 04:43

Here's how I do it.

It's in shell, but you could call the shell bits from Python with os.system.

The PDF parsing in scraperwiki-python is just Python wrappers for pdftotext and pdftohtml.

Seth Woodworth wrote: If you're just looking to get a text dump of the PDF. I've had good results with uploading the document to google drive, and then downloading the .txt version.  Google Drive will also OCR embedded images in documents and some other nifty stuff.  It's not great, but it's usually good enough to use as a search index of PDFs. http://github.com/finalsclub/karmaworld contains the google drive code deep in it's depths.  I'd be happy to walk you through the hairy parts if you are interested.

On Thu, Dec 5, 2013 at 4:12 PM, Tarek Amr wrote:
I can see some tools mentioned here, one of the is pythonic PDFMiner https://github.com/okfn/ideas/issues/52

On 5 December 2013 21:53, Alioune Dia wrote:
I'am  looking for a  best Python Library for scrapping some bunch of
pdf  files .I' am actually focus on
https://github.com/scraperwiki/scraperwiki-python library . did Anyone
already experimented  it .  Is it exist a more interesting library Any
Help will be appreciate .
okfn-labs mailing list
okfn-labs at lists.okfn.org
Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs

Best Regards
Tarek Amr


okfn-labs mailing list
okfn-labs at lists.okfn.org
Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs


okfn-labs mailing list
okfn-labs at lists.okfn.org
Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20131206/f23d9951/attachment-0004.html>

More information about the okfn-labs mailing list