[OpenSpending] Extracting data from PDFs

Anders Pedersen anderspeders at gmail.com
Thu Dec 20 16:40:23 UTC 2012


Hi all,

The transparency group AlterEU has done some analysis on poorly handwritten
"Declaration of Financial Interest" from all MEPs in the European
Parliament. Their report from last July even includes screenshots of some
of the worst filings from MEPs:
http://www.alter-eu.org/sites/default/files/documents/transparency_in_the_european_parliament_july2012.pdf

Another more friendly example is the Belgian state register, which offers
annual reports as PDFs. Below is the 2010 annual report from Exxon Belgium:
https://www.dropbox.com/s/q27xp6tr4va6bt7/Exxon.pdf

Cheers,
Anders

On Thu, Dec 20, 2012 at 3:52 PM, Nuno Moniz <nunompmoniz at gmail.com> wrote:

> PDF tables is a nightmare to parse.
>
> I didn't work with OCR but if there's space for inputs on parsing of
> PDF's, my master thesis was in a very big part developing a system capable
> of extracting structure, text and entities from the Portuguese Legislation
> (example http://dre.pt/pdfgratis/2012/12/24600.pdf)
>
> Cheers.
> Nuno
>
> 2012/12/20 Lucia Mazzoni <lucia at spippola.it>
>
>>  On 20 December 2012 12:17, Lucy Chambers <lucy.chambers at okfn.org> wrote:
>>
>>> Hi all,
>>>
>>> I figured you might be able to help. My colleague, Michael, is writing
>>> a course on Optical Character Recognition for the School of Data
>>> project.
>>>
>>> He's done the easy, nicely formatted PDFs. Now he's looking for some
>>> real-life, nasty examples of PDFs that people have to deal with.
>>> Probably scanned / photographed PDFs, or just really tricky PDFs so
>>> that we get a good difficulty scale across the course.
>>>
>>> Any pointers - very helpful, it's really nice to base these courses on
>>> real data that people have actually been grappling with!
>>>
>>>
>> Hi,
>> these are just two very little example.
>>
>> In Italy our public institutions usually  publish results of tenders like
>> in this way:
>>
>> http://www.ponrec.it/media/137519/585-ric_28set12_graduatoria-smart-cities.pdf (the
>> worst one)
>> or this way
>>
>> http://www.ponrec.it/media/91323/elenco_idee_progettuali_approvate__d.d.84_ric._del_2marzo2012.pdf (the
>> better one)
>>
>> both terrific if I need to manage the data.
>> Hope this helps
>> Lucia
>>
>> _______________________________________________
>> openspending mailing list
>> openspending at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/openspending
>> Unsubscribe: http://lists.okfn.org/mailman/options/openspending
>>
>>
>
> _______________________________________________
> openspending mailing list
> openspending at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/openspending
> Unsubscribe: http://lists.okfn.org/mailman/options/openspending
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending/attachments/20121220/92d5b393/attachment.html>


More information about the openspending mailing list