[OpenSpending] Extracting data from PDFs

Thu Dec 20 11:59:18 UTC 2012

The national library of sweden is in the process of digitizing
government documents that are part of the legislative process "SOU
documents". Raw output in PDF is here:

http://regina.kb.se/sou/

I believe they are OCR:ed but without structure. It would be of value
for a lot of people (both open data and from an accessibility
perspective) if these documents could be enhanced with metadata and
structure (what is the title? metadata? what are the headings in the
doc?).

Regards,

Peter

2012/12/20 Lucy Chambers <lucy.chambers at okfn.org>:
> Hi all,
>
> I figured you might be able to help. My colleague, Michael, is writing
> a course on Optical Character Recognition for the School of Data
> project.
>
> He's done the easy, nicely formatted PDFs. Now he's looking for some
> real-life, nasty examples of PDFs that people have to deal with.
> Probably scanned / photographed PDFs, or just really tricky PDFs so
> that we get a good difficulty scale across the course.
>
> Any pointers - very helpful, it's really nice to base these courses on
> real data that people have actually been grappling with!
>
> Lucy
>
> --
> Lucy Chambers
> Project Coordinator,
> School of Data & OpenSpending
> Open Knowledge Foundation
> Skype: lucyfediachambers
> Twitter: @lucyfedia
>
> _______________________________________________
> openspending mailing list
> openspending at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/openspending
> Unsubscribe: http://lists.okfn.org/mailman/options/openspending