[okfn-labs] Questions on data scraping tables within "picture" pdfs.

Tom Morris tfmorris at gmail.com
Thu Feb 7 16:12:48 UTC 2013


If your goal is to get the data, services like Captricity have this as a
standard offering.

If you actually want to build software to do this, I'd recommend using
something like OpenCV to generate a guess at segmentation and then have the
users either approve or correct it. There's a working code example for a
similar thing here:
http://stackoverflow.com/questions/10196198/how-to-remove-convexity-defects-in-sudoku-square/10226971#10226971(scroll
down for the Python version)

OpenCV has Python bindings and, yes, you could use PyBossa to build this
type of service (although you'd probably have to host it yourself if you
wanted to make use of third party libraries such as OpenCV).

Tom

On Thu, Feb 7, 2013 at 10:04 AM, Hans Thompson <hans.thompson1 at gmail.com>wrote:

>
> Hello open data crusaders. I hope I am properly following the mailing list
> rules as a newcomer and programming neophyte (some conversational R and
> learning python at the moment).
>
> I want to build a microtasking project to take pdf "pictures" of tables
> and break them into rows and columns.  This way each cell can be a
> transcription task with a cell identity.
>
> I've thought a lot on how to do this with R (because a superior QC process
> could be implemented easier from my personal experiance) but it lacks the
> kind of picture manipulation tools that I am supposing aleady exist for
> python etc.
>
> My question:  could pybossa be used to return the rows and column of an
> image array from user call from a click? So the user could click for each
> space between row and column and split the table picture into a table of
> pictures?
>
> Does a better tool exist for this type of task?
>
> Thanks.
> Hans Thompson
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130207/4aa716e6/attachment-0002.html>


More information about the okfn-labs mailing list