[okfn-labs] scrapping PDF files
Enric Garcia Torrents
enricgarcia at uoc.edu
Fri Dec 6 04:20:57 UTC 2013
You can also use bash (Unix shell) script, if using Linux / OS-X:
for file in *.pdf ; do pdftotext -layout "$file" ; done
Enric G. Torrents
Email: e.g.cn at ieee.org
Tel.: +8613122141470
Skype: torrents.enric
cn.linkedin.com/in/enrictorrents/
--- Missatge original de Thomas Levine per a Seth Woodworth ,Tarek Amr amb còpia a okfn-labs enviat el 06.12.2013 04:43
Here's how I do it.
http://thomaslevine.com/!/parsing-pdfs
It's in shell, but you could call the shell bits from Python with os.system.
The PDF parsing in scraperwiki-python is just Python wrappers for pdftotext and pdftohtml.
Seth Woodworth wrote: If you're just looking to get a text dump of the PDF. I've had good results with uploading the document to google drive, and then downloading the .txt version. Google Drive will also OCR embedded images in documents and some other nifty stuff. It's not great, but it's usually good enough to use as a search index of PDFs. http://github.com/finalsclub/karmaworld contains the google drive code deep in it's depths. I'd be happy to walk you through the hairy parts if you are interested.
On Thu, Dec 5, 2013 at 4:12 PM, Tarek Amr wrote:
I can see some tools mentioned here, one of the is pythonic PDFMiner https://github.com/okfn/ideas/issues/52
On 5 December 2013 21:53, Alioune Dia wrote:
I'am looking for a best Python Library for scrapping some bunch of
pdf files .I' am actually focus on
https://github.com/scraperwiki/scraperwiki-python library . did Anyone
already experimented it . Is it exist a more interesting library Any
Help will be appreciate .
--Ad
_______________________________________________
okfn-labs mailing list
okfn-labs at lists.okfn.org
http://lists.okfn.org/mailman/listinfo/okfn-labs
Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
--
Best Regards
Tarek Amr
http://tarekamr.appspot.com/
_______________________________________________
okfn-labs mailing list
okfn-labs at lists.okfn.org
http://lists.okfn.org/mailman/listinfo/okfn-labs
Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
_________________________________________
okfn-labs mailing list
okfn-labs at lists.okfn.org
http://lists.okfn.org/mailman/listinfo/okfn-labs
Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20131206/f23d9951/attachment-0004.html>
More information about the okfn-labs
mailing list