[OpenSpending] Extracting data from PDFs

David Cabo david.cabo at gmail.com
Thu Dec 20 14:04:25 UTC 2012


 Hi Lucy, Michael,

 The Spanish Congress published last year for the first time the asset declarations for its members. They are forms scanned with varying levels of quality, and extracting the data is not easy because of the complex structure of the documents. Two random examples (from a set of 350):

http://www.congreso.es/docbienes/leg10/000331/000331_000_e_0004777_20120209.pdf
http://www.congreso.es/docbienes/leg10/000068/000068_000_e_0001434_20111223.pdf

 We crowdsourced via Google Docs the manual parsing of the PDFs at the time, so we have both the original PDFs and the extracted data, in case it is useful. (We had elections in December, so we have structured data for the old members, not the current ones.)

 regards,

/david 

On Thursday, December 20, 2012 at 12:17 PM, Lucy Chambers wrote:

> Hi all,
> 
> I figured you might be able to help. My colleague, Michael, is writing
> a course on Optical Character Recognition for the School of Data
> project.
> 
> He's done the easy, nicely formatted PDFs. Now he's looking for some
> real-life, nasty examples of PDFs that people have to deal with.
> Probably scanned / photographed PDFs, or just really tricky PDFs so
> that we get a good difficulty scale across the course.
> 
> Any pointers - very helpful, it's really nice to base these courses on
> real data that people have actually been grappling with!
> 
> Lucy
> 
> -- 
> Lucy Chambers
> Project Coordinator,
> School of Data & OpenSpending
> Open Knowledge Foundation
> Skype: lucyfediachambers
> Twitter: @lucyfedia
> 
> _______________________________________________
> openspending mailing list
> openspending at lists.okfn.org (mailto:openspending at lists.okfn.org)
> http://lists.okfn.org/mailman/listinfo/openspending
> Unsubscribe: http://lists.okfn.org/mailman/options/openspending
> 
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending/attachments/20121220/5eebe8da/attachment.html>


More information about the openspending mailing list