[OpenSpending] Extracting data from PDFs

Diego de la Mora diego at fundar.org.mx
Thu Dec 20 18:42:18 UTC 2012


Mexico Federal Budgets are presented in PDFs structured in ways that make
them difficult to work with.

Here are some examples:
Health Budget
http://www.apartados.hacienda.gob.mx/presupuesto/temas/ppef/2013/temas/tomos/12/r12_afpe.pdf
Education:
http://www.apartados.hacienda.gob.mx/presupuesto/temas/ppef/2013/temas/tomos/11/r11_afpe.pdf

The Budget Decree is a scanned PDF (we are waiting for the 2013 Decree to
be published tomorrow or on saturday and it will be fantastic to have it a
usable format).

Best,
D


2012/12/20 Nuno Moniz <nunompmoniz at gmail.com>

> PDF tables is a nightmare to parse.
>
> I didn't work with OCR but if there's space for inputs on parsing of
> PDF's, my master thesis was in a very big part developing a system capable
> of extracting structure, text and entities from the Portuguese Legislation
> (example http://dre.pt/pdfgratis/2012/12/24600.pdf)
>
> Cheers.
> Nuno
>
> 2012/12/20 Lucia Mazzoni <lucia at spippola.it>
>
>> On 20 December 2012 12:17, Lucy Chambers <lucy.chambers at okfn.org> wrote:
>>
>>> Hi all,
>>>
>>> I figured you might be able to help. My colleague, Michael, is writing
>>> a course on Optical Character Recognition for the School of Data
>>> project.
>>>
>>> He's done the easy, nicely formatted PDFs. Now he's looking for some
>>> real-life, nasty examples of PDFs that people have to deal with.
>>> Probably scanned / photographed PDFs, or just really tricky PDFs so
>>> that we get a good difficulty scale across the course.
>>>
>>> Any pointers - very helpful, it's really nice to base these courses on
>>> real data that people have actually been grappling with!
>>>
>>>
>> Hi,
>> these are just two very little example.
>>
>> In Italy our public institutions usually  publish results of tenders like
>> in this way:
>>
>> http://www.ponrec.it/media/137519/585-ric_28set12_graduatoria-smart-cities.pdf (the
>> worst one)
>> or this way
>>
>> http://www.ponrec.it/media/91323/elenco_idee_progettuali_approvate__d.d.84_ric._del_2marzo2012.pdf (the
>> better one)
>>
>> both terrific if I need to manage the data.
>> Hope this helps
>> Lucia
>>
>> _______________________________________________
>> openspending mailing list
>> openspending at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/openspending
>> Unsubscribe: http://lists.okfn.org/mailman/options/openspending
>>
>>
>
> _______________________________________________
> openspending mailing list
> openspending at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/openspending
> Unsubscribe: http://lists.okfn.org/mailman/options/openspending
>
>


-- 
Diego de la Mora Maurer
Área de Presupuestos y Políticas Públicas
Fundar, Centro de Análisis e Investigación
www.fundar.org.mx
Tel. + 52 (55) 55543001 x 119
Cel. 04455 3223 2797
@diegodelam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending/attachments/20121220/87c87eda/attachment.html>


More information about the openspending mailing list