[okfn-labs] scrapping PDF files

Alioune Dia dia.aliounes at gmail.com
Mon Dec 9 12:05:43 UTC 2013


 data wrangling is a hard job :)

2013/12/9 Michael Bauer <michael.bauer at okfn.org>:
> Alioune,
>
> On Fri, Dec 06, 2013 at 07:30:23PM +0100, Alioune Dia wrote:
>> Hi All
>>
>> For text brut scrapping , it seem like That many tools give good
>> result , I experimented The PdfMinder for the past and got good
>> results . The problem is also a table scrapping, with many tools , I
>> had some problems like  --multilined rows-- , -- non useful
>> information--Empty line-- , ect . With scrapper wiki , I 'am also
>
> Yes this is always a problem. When doing PDF scraping I tend to do a
> clean-up of the data afterwards. It is always messy (no matter what tool
> you use).
>
> Michael
>
> --
> Data Diva | skype: mihi_tr | @mihi_tr
> The Open Knowledge Foundation | School of Data
> http://okfn.org | http://schoolofdata.org
> GPG/PGP key: http://tentacleriot.eu/mihi.asc
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs



More information about the okfn-labs mailing list