[okfn-labs] scrapping PDF files

Thomas Levine . at thomaslevine.com
Fri Dec 6 03:43:38 UTC 2013


Here's how I do it.
http://thomaslevine.com/!/parsing-pdfs

It's in shell, but you could call the shell bits from Python with os.system.

The PDF parsing in scraperwiki-python is just Python wrappers for pdftotext and pdftohtml.

Seth Woodworth <seth at sethish.com> wrote:
>If you're just looking to get a text dump of the PDF. I've had good
>results
>with uploading the document to google drive, and then downloading the
>.txt
>version.  Google Drive will also OCR embedded images in documents and
>some
>other nifty stuff.  It's not great, but it's usually good enough to use
>as
>a search index of PDFs.
>
>http://github.com/finalsclub/karmaworld contains the google drive code
>deep
>in it's depths.  I'd be happy to walk you through the hairy parts if
>you
>are interested.
>
>
>On Thu, Dec 5, 2013 at 4:12 PM, Tarek Amr <tarekamr at gmail.com> wrote:
>
>> I can see some tools mentioned here, one of the is pythonic PDFMiner
>> https://github.com/okfn/ideas/issues/52
>>
>>
>>
>> On 5 December 2013 21:53, Alioune Dia <dia.aliounes at gmail.com> wrote:
>>
>>> I'am  looking for a  best Python Library for scrapping some bunch of
>>> pdf  files .I' am actually focus on
>>> https://github.com/scraperwiki/scraperwiki-python library . did
>Anyone
>>> already experimented  it .  Is it exist a more interesting library
>Any
>>> Help will be appreciate .
>>> --Ad
>>> _______________________________________________
>>> okfn-labs mailing list
>>> okfn-labs at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/okfn-labs
>>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>>>
>>
>>
>>
>> --
>> Best Regards
>> Tarek Amr
>>
>> http://tarekamr.appspot.com/
>>
>>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>>
>>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>okfn-labs mailing list
>okfn-labs at lists.okfn.org
>http://lists.okfn.org/mailman/listinfo/okfn-labs
>Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20131205/2e00961c/attachment-0004.html>


More information about the okfn-labs mailing list