[OpenSpending] Extracting data from PDFs

Diego de la Mora diego at fundar.org.mx
Thu Dec 20 23:05:41 UTC 2012


And the Buget Decree:


First part: http://gaceta.diputados.gob.mx/PDF/62/2012/dic/20121220-A.pdf  *
***

** **

Second: http://gaceta.diputados.gob.mx/PDF/62/2012/dic/20121220-B.pdf


Best,

D

2012/12/20 Diego de la Mora <diego at fundar.org.mx>

> Mexico Federal Budgets are presented in PDFs structured in ways that make
> them difficult to work with.
>
> Here are some examples:
> Health Budget
>
> http://www.apartados.hacienda.gob.mx/presupuesto/temas/ppef/2013/temas/tomos/12/r12_afpe.pdf
> Education:
>
> http://www.apartados.hacienda.gob.mx/presupuesto/temas/ppef/2013/temas/tomos/11/r11_afpe.pdf
>
> The Budget Decree is a scanned PDF (we are waiting for the 2013 Decree to
> be published tomorrow or on saturday and it will be fantastic to have it a
> usable format).
>
> Best,
> D
>
>
> 2012/12/20 Nuno Moniz <nunompmoniz at gmail.com>
>
>> PDF tables is a nightmare to parse.
>>
>> I didn't work with OCR but if there's space for inputs on parsing of
>> PDF's, my master thesis was in a very big part developing a system capable
>> of extracting structure, text and entities from the Portuguese Legislation
>> (example http://dre.pt/pdfgratis/2012/12/24600.pdf)
>>
>> Cheers.
>> Nuno
>>
>> 2012/12/20 Lucia Mazzoni <lucia at spippola.it>
>>
>>>  On 20 December 2012 12:17, Lucy Chambers <lucy.chambers at okfn.org>wrote:
>>>
>>>> Hi all,
>>>>
>>>> I figured you might be able to help. My colleague, Michael, is writing
>>>> a course on Optical Character Recognition for the School of Data
>>>> project.
>>>>
>>>> He's done the easy, nicely formatted PDFs. Now he's looking for some
>>>> real-life, nasty examples of PDFs that people have to deal with.
>>>> Probably scanned / photographed PDFs, or just really tricky PDFs so
>>>> that we get a good difficulty scale across the course.
>>>>
>>>> Any pointers - very helpful, it's really nice to base these courses on
>>>> real data that people have actually been grappling with!
>>>>
>>>>
>>> Hi,
>>> these are just two very little example.
>>>
>>> In Italy our public institutions usually  publish results of tenders
>>> like in this way:
>>>
>>> http://www.ponrec.it/media/137519/585-ric_28set12_graduatoria-smart-cities.pdf (the
>>> worst one)
>>> or this way
>>>
>>> http://www.ponrec.it/media/91323/elenco_idee_progettuali_approvate__d.d.84_ric._del_2marzo2012.pdf (the
>>> better one)
>>>
>>> both terrific if I need to manage the data.
>>> Hope this helps
>>> Lucia
>>>
>>> _______________________________________________
>>> openspending mailing list
>>> openspending at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/openspending
>>> Unsubscribe: http://lists.okfn.org/mailman/options/openspending
>>>
>>>
>>
>> _______________________________________________
>> openspending mailing list
>> openspending at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/openspending
>> Unsubscribe: http://lists.okfn.org/mailman/options/openspending
>>
>>
>
>
> --
> Diego de la Mora Maurer
> Área de Presupuestos y Políticas Públicas
> Fundar, Centro de Análisis e Investigación
> www.fundar.org.mx
> Tel. + 52 (55) 55543001 x 119
> Cel. 04455 3223 2797
> @diegodelam
>



-- 
Diego de la Mora Maurer
Área de Presupuestos y Políticas Públicas
Fundar, Centro de Análisis e Investigación
www.fundar.org.mx
Tel. + 52 (55) 55543001 x 119
Cel. 04455 3223 2797
@diegodelam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending/attachments/20121220/2c53dbf6/attachment.html>


More information about the openspending mailing list