[okfn-labs] Questions on data scraping tables within "picture" pdfs.

Marian Steinbach marian at sendung.de
Thu Feb 7 15:21:14 UTC 2013


Hi Hans!

This doesn't really answer your actual questions, but are you aware of this
article?

http://www.propublica.org/nerds/item/image-to-text-ocr-and-imagemagick

In practice it might be more difficult, since scanned documents tend to be
slighty rotated.

Marian



2013/2/7 Hans Thompson <hans.thompson1 at gmail.com>

>
> Hello open data crusaders. I hope I am properly following the mailing list
> rules as a newcomer and programming neophyte (some conversational R and
> learning python at the moment).
>
> I want to build a microtasking project to take pdf "pictures" of tables
> and break them into rows and columns.  This way each cell can be a
> transcription task with a cell identity.
>
> I've thought a lot on how to do this with R (because a superior QC process
> could be implemented easier from my personal experiance) but it lacks the
> kind of picture manipulation tools that I am supposing aleady exist for
> python etc.
>
> My question:  could pybossa be used to return the rows and column of an
> image array from user call from a click? So the user could click for each
> space between row and column and split the table picture into a table of
> pictures?
>
> Does a better tool exist for this type of task?
>
> Thanks.
> Hans Thompson
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130207/f674c5de/attachment-0002.html>


More information about the okfn-labs mailing list