[OpenSpending] Extracting data from PDFs

Lucy Chambers lucy.chambers at okfn.org
Thu Dec 20 17:31:31 UTC 2012


Thank you to all for this - these are really useful examples and it is
good to make sure we check in with reality and see the different areas
in which these skills could be useful - so we'll keep these in our
back pocket as ammunition.

One question for David Cabo: How did you phrase / break down the task
for the crowdsourcing of the extraction from PDFs (+ factchecking
afterwards)? This is something we should possibly explore as a
possible method if people are unable to code / the quality of the docs
is too bad. Just very keen to know how you framed it and how you
fact-checked it!

Thanks once again to all :)

Lucy

On Thu, Dec 20, 2012 at 2:04 PM, David Cabo <david.cabo at gmail.com> wrote:
>  Hi Lucy, Michael,
>
>  The Spanish Congress published last year for the first time the asset
> declarations for its members. They are forms scanned with varying levels of
> quality, and extracting the data is not easy because of the complex
> structure of the documents. Two random examples (from a set of 350):
>
> http://www.congreso.es/docbienes/leg10/000331/000331_000_e_0004777_20120209.pdf
> http://www.congreso.es/docbienes/leg10/000068/000068_000_e_0001434_20111223.pdf
>
>  We crowdsourced via Google Docs the manual parsing of the PDFs at the time,
> so we have both the original PDFs and the extracted data, in case it is
> useful. (We had elections in December, so we have structured data for the
> old members, not the current ones.)
>
>  regards,
>
> /david
>
> On Thursday, December 20, 2012 at 12:17 PM, Lucy Chambers wrote:
>
> Hi all,
>
> I figured you might be able to help. My colleague, Michael, is writing
> a course on Optical Character Recognition for the School of Data
> project.
>
> He's done the easy, nicely formatted PDFs. Now he's looking for some
> real-life, nasty examples of PDFs that people have to deal with.
> Probably scanned / photographed PDFs, or just really tricky PDFs so
> that we get a good difficulty scale across the course.
>
> Any pointers - very helpful, it's really nice to base these courses on
> real data that people have actually been grappling with!
>
> Lucy
>
> --
> Lucy Chambers
> Project Coordinator,
> School of Data & OpenSpending
> Open Knowledge Foundation
> Skype: lucyfediachambers
> Twitter: @lucyfedia
>
> _______________________________________________
> openspending mailing list
> openspending at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/openspending
> Unsubscribe: http://lists.okfn.org/mailman/options/openspending
>
>
>
> _______________________________________________
> openspending mailing list
> openspending at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/openspending
> Unsubscribe: http://lists.okfn.org/mailman/options/openspending
>



-- 
Lucy Chambers
Project Coordinator,
School of Data & OpenSpending
Open Knowledge Foundation
Skype: lucyfediachambers
Twitter: @lucyfedia




More information about the openspending mailing list