[okfn-labs] scrapping PDF files

Alioune Dia dia.aliounes at gmail.com
Fri Dec 6 18:30:23 UTC 2013


Hi All

For text brut scrapping , it seem like That many tools give good
result , I experimented The PdfMinder for the past and got good
results . The problem is also a table scrapping, with many tools , I
had some problems like  --multilined rows-- , -- non useful
information--Empty line-- , ect . With scrapper wiki , I 'am also
facing with the same probleme.So I'am wondering  if the problems is
not related to my PDF files ?

Also Thank for putting this wiki
https://github.com/okfn/ideas/issues/52 , I found it to be very
interesting. I will test all libraries listed  here to see.

--Ad

2013/12/6 Michael Bauer <michael.bauer at okfn.org>:
> Hi,
>
> I've done scraping with scraperwikis pdftoxml. Works quite well for what it
> is (a workaround converting PDF to another format, that's easier to parse).
>
> What I generally do is convert it to xml, then figure out where the text is
> I need and write XPath expressions for that.
>
> Happy to give more detailed instructions if needed.
>
> Michael
>
> On Thu, Dec 05, 2013 at 08:53:39PM +0100, Alioune Dia wrote:
>> I'am  looking for a  best Python Library for scrapping some bunch of
>> pdf  files .I' am actually focus on
>> https://github.com/scraperwiki/scraperwiki-python library . did Anyone
>> already experimented  it .  Is it exist a more interesting library Any
>> Help will be appreciate .
>> --Ad
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
> --
> Data Diva | skype: mihi_tr | @mihi_tr
> The Open Knowledge Foundation | School of Data
> http://okfn.org | http://schoolofdata.org
> GPG/PGP key: http://tentacleriot.eu/mihi.asc
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs



More information about the okfn-labs mailing list