[OpenSpending] Extracting data from PDFs

Ivan Begtin ibegtin at gmail.com
Thu Dec 20 17:23:58 UTC 2012


Hi Lucy,
  we have lot's of nasty old russian texts inside PDF like -
http://istmat.info/node/18484 "Russian Empire statistical calendar".
It's really hard and nasty.

Best Regards,
  Ivan Begtin


2012/12/20 Lucy Chambers <lucy.chambers at okfn.org>:
> Hi all,
>
> I figured you might be able to help. My colleague, Michael, is writing
> a course on Optical Character Recognition for the School of Data
> project.
>
> He's done the easy, nicely formatted PDFs. Now he's looking for some
> real-life, nasty examples of PDFs that people have to deal with.
> Probably scanned / photographed PDFs, or just really tricky PDFs so
> that we get a good difficulty scale across the course.
>
> Any pointers - very helpful, it's really nice to base these courses on
> real data that people have actually been grappling with!
>
> Lucy
>
> --
> Lucy Chambers
> Project Coordinator,
> School of Data & OpenSpending
> Open Knowledge Foundation
> Skype: lucyfediachambers
> Twitter: @lucyfedia
>
> _______________________________________________
> openspending mailing list
> openspending at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/openspending
> Unsubscribe: http://lists.okfn.org/mailman/options/openspending



-- 

Best Regards,
  Ivan Begtin




More information about the openspending mailing list