[okfn-labs] Questions on data scraping tables within "picture" pdfs.

Michael Bauer michael.bauer at okfn.org
Thu Feb 7 16:17:43 UTC 2013


Hans, Tom,

I've recently thought about similar problems - I would approach this
similarly. Probably have users in Pybossa draw tables, columns and rows
- split out all individual cells - OCR them and have them corrected in a
  seperate pybossa app.

OpenCV is great to think about - you could probably automate some of it (if
there are clear markings ...

Michael

On Thu, Feb 07, 2013 at 11:12:48AM -0500, Tom Morris wrote:
> If your goal is to get the data, services like Captricity have this as a
> standard offering.
> 
> If you actually want to build software to do this, I'd recommend using
> something like OpenCV to generate a guess at segmentation and then have the
> users either approve or correct it. There's a working code example for a
> similar thing here:
> http://stackoverflow.com/questions/10196198/how-to-remove-convexity-defects-in-sudoku-square/10226971#10226971(scroll
> down for the Python version)
> 
> OpenCV has Python bindings and, yes, you could use PyBossa to build this
> type of service (although you'd probably have to host it yourself if you
> wanted to make use of third party libraries such as OpenCV).
> 
> Tom
> 
> On Thu, Feb 7, 2013 at 10:04 AM, Hans Thompson <hans.thompson1 at gmail.com>wrote:
> 
> >
> > Hello open data crusaders. I hope I am properly following the mailing list
> > rules as a newcomer and programming neophyte (some conversational R and
> > learning python at the moment).
> >
> > I want to build a microtasking project to take pdf "pictures" of tables
> > and break them into rows and columns.  This way each cell can be a
> > transcription task with a cell identity.
> >
> > I've thought a lot on how to do this with R (because a superior QC process
> > could be implemented easier from my personal experiance) but it lacks the
> > kind of picture manipulation tools that I am supposing aleady exist for
> > python etc.
> >
> > My question:  could pybossa be used to return the rows and column of an
> > image array from user call from a click? So the user could click for each
> > space between row and column and split the table picture into a table of
> > pictures?
> >
> > Does a better tool exist for this type of task?
> >
> > Thanks.
> > Hans Thompson
> >
> > _______________________________________________
> > okfn-labs mailing list
> > okfn-labs at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/okfn-labs
> > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
> >
> >

> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs


-- 
Data Wrangler with the Open Knowledge Foundation (OKFN.org)
GPG/PGP key: http://tentacleriot.eu/mihi.asc
Twitter: @mihi_tr Skype: mihi_tr




More information about the okfn-labs mailing list