[OpenSpending] Extracting data from PDFs

Nuno Moniz nunompmoniz at gmail.com
Thu Dec 20 14:52:06 UTC 2012


PDF tables is a nightmare to parse.

I didn't work with OCR but if there's space for inputs on parsing of PDF's,
my master thesis was in a very big part developing a system capable of
extracting structure, text and entities from the Portuguese Legislation
(example http://dre.pt/pdfgratis/2012/12/24600.pdf)

Cheers.
Nuno

2012/12/20 Lucia Mazzoni <lucia at spippola.it>

> On 20 December 2012 12:17, Lucy Chambers <lucy.chambers at okfn.org> wrote:
>
>> Hi all,
>>
>> I figured you might be able to help. My colleague, Michael, is writing
>> a course on Optical Character Recognition for the School of Data
>> project.
>>
>> He's done the easy, nicely formatted PDFs. Now he's looking for some
>> real-life, nasty examples of PDFs that people have to deal with.
>> Probably scanned / photographed PDFs, or just really tricky PDFs so
>> that we get a good difficulty scale across the course.
>>
>> Any pointers - very helpful, it's really nice to base these courses on
>> real data that people have actually been grappling with!
>>
>>
> Hi,
> these are just two very little example.
>
> In Italy our public institutions usually  publish results of tenders like
> in this way:
>
> http://www.ponrec.it/media/137519/585-ric_28set12_graduatoria-smart-cities.pdf (the
> worst one)
> or this way
>
> http://www.ponrec.it/media/91323/elenco_idee_progettuali_approvate__d.d.84_ric._del_2marzo2012.pdf (the
> better one)
>
> both terrific if I need to manage the data.
> Hope this helps
> Lucia
>
> _______________________________________________
> openspending mailing list
> openspending at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/openspending
> Unsubscribe: http://lists.okfn.org/mailman/options/openspending
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending/attachments/20121220/5463fde1/attachment.html>


More information about the openspending mailing list