[okfn-labs] scrapping PDF files

Matthew Fullerton matt.fullerton at gmail.com
Sun Dec 8 20:26:34 UTC 2013


Hi,
I've only used scraperwiki, with good results for what I was doing. Please
report back on which tool works best for you, I'm sure it's not just me
that would be interested to hear about your experience!

-Matt


On 6 December 2013 19:30, Alioune Dia <dia.aliounes at gmail.com> wrote:

> Hi All
>
> For text brut scrapping , it seem like That many tools give good
> result , I experimented The PdfMinder for the past and got good
> results . The problem is also a table scrapping, with many tools , I
> had some problems like  --multilined rows-- , -- non useful
> information--Empty line-- , ect . With scrapper wiki , I 'am also
> facing with the same probleme.So I'am wondering  if the problems is
> not related to my PDF files ?
>
> Also Thank for putting this wiki
> https://github.com/okfn/ideas/issues/52 , I found it to be very
> interesting. I will test all libraries listed  here to see.
>
> --Ad
>
> 2013/12/6 Michael Bauer <michael.bauer at okfn.org>:
> > Hi,
> >
> > I've done scraping with scraperwikis pdftoxml. Works quite well for what
> it
> > is (a workaround converting PDF to another format, that's easier to
> parse).
> >
> > What I generally do is convert it to xml, then figure out where the text
> is
> > I need and write XPath expressions for that.
> >
> > Happy to give more detailed instructions if needed.
> >
> > Michael
> >
> > On Thu, Dec 05, 2013 at 08:53:39PM +0100, Alioune Dia wrote:
> >> I'am  looking for a  best Python Library for scrapping some bunch of
> >> pdf  files .I' am actually focus on
> >> https://github.com/scraperwiki/scraperwiki-python library . did Anyone
> >> already experimented  it .  Is it exist a more interesting library Any
> >> Help will be appreciate .
> >> --Ad
> >> _______________________________________________
> >> okfn-labs mailing list
> >> okfn-labs at lists.okfn.org
> >> http://lists.okfn.org/mailman/listinfo/okfn-labs
> >> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
> >
> > --
> > Data Diva | skype: mihi_tr | @mihi_tr
> > The Open Knowledge Foundation | School of Data
> > http://okfn.org | http://schoolofdata.org
> > GPG/PGP key: http://tentacleriot.eu/mihi.asc
> > _______________________________________________
> > okfn-labs mailing list
> > okfn-labs at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/okfn-labs
> > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20131208/371556bd/attachment-0004.html>


More information about the okfn-labs mailing list