[okfn-labs] scrapping PDF files

Seth Woodworth seth at sethish.com
Thu Dec 5 21:30:27 UTC 2013


If you're just looking to get a text dump of the PDF. I've had good results
with uploading the document to google drive, and then downloading the .txt
version.  Google Drive will also OCR embedded images in documents and some
other nifty stuff.  It's not great, but it's usually good enough to use as
a search index of PDFs.

http://github.com/finalsclub/karmaworld contains the google drive code deep
in it's depths.  I'd be happy to walk you through the hairy parts if you
are interested.


On Thu, Dec 5, 2013 at 4:12 PM, Tarek Amr <tarekamr at gmail.com> wrote:

> I can see some tools mentioned here, one of the is pythonic PDFMiner
> https://github.com/okfn/ideas/issues/52
>
>
>
> On 5 December 2013 21:53, Alioune Dia <dia.aliounes at gmail.com> wrote:
>
>> I'am  looking for a  best Python Library for scrapping some bunch of
>> pdf  files .I' am actually focus on
>> https://github.com/scraperwiki/scraperwiki-python library . did Anyone
>> already experimented  it .  Is it exist a more interesting library Any
>> Help will be appreciate .
>> --Ad
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>>
>
>
>
> --
> Best Regards
> Tarek Amr
>
> http://tarekamr.appspot.com/
>
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20131205/918bd3f9/attachment-0004.html>


More information about the okfn-labs mailing list