[okfn-labs] scrapping PDF files

Michael Bauer michael.bauer at okfn.org
Fri Dec 6 08:51:29 UTC 2013


Hi,

I've done scraping with scraperwikis pdftoxml. Works quite well for what it
is (a workaround converting PDF to another format, that's easier to parse).

What I generally do is convert it to xml, then figure out where the text is
I need and write XPath expressions for that.

Happy to give more detailed instructions if needed.

Michael

On Thu, Dec 05, 2013 at 08:53:39PM +0100, Alioune Dia wrote:
> I'am  looking for a  best Python Library for scrapping some bunch of
> pdf  files .I' am actually focus on
> https://github.com/scraperwiki/scraperwiki-python library . did Anyone
> already experimented  it .  Is it exist a more interesting library Any
> Help will be appreciate .
> --Ad
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs

-- 
Data Diva | skype: mihi_tr | @mihi_tr
The Open Knowledge Foundation | School of Data
http://okfn.org | http://schoolofdata.org 
GPG/PGP key: http://tentacleriot.eu/mihi.asc



More information about the okfn-labs mailing list