[okfn-labs] Questions on data scraping tables within "picture" pdfs.

Fri Feb 8 01:20:42 UTC 2013

Thanks Everyone!

I realize now I could have been more specific about my specifications.
Sorry about that.  The documents I am trying to covert are not forms with
standard formatting on each page or tables with lines already between
rows/columns.

I'll try using the RMagick script when I get the install working (thanks
Marian) to split the tables up by lines I draw.  What do people think would
be the best way to draw these lines quickly for many tables without a
standard size or placement on a page?

I'd also like to use the PyBossa API (thanks Michael) further but I haven't
seen any tasks that would involve clicking on the image to return the pixel
coordinates within the image.  Do people think that that kind of task would
be easily incorporated into the current PyBossa?

I will also take a look at HTML5 (thanks Stefan) cause this seems coolest
but my own limited knowledge is going to hold me back here.

The documents I have in mind for this project is the Comprehensive Annual
Finance Report (CAFR) for US cities.  The content of the tables is
standardized by the Govermental Accounting Standards Board (GASB) and
follow Generally Accepted Accounting Principals (GAAP) so comparison of
accounts by city or year are possible.  Each city publishes their CAFR
independently though which necessitates a general tool to break apart
tables without Computer Vision for recognizing tables.  I'd like to use a
series of microtasks with some quality control. OCR of the final cells
seems smart.

I'll try out RMagick first.

Also thanks Tom for showing me Captricity.  I'll send in a complex example
and see how well they do!

Hans Thompson

On Feb 7, 2013 11:17 AM, "Michael Bauer" <michael.bauer at okfn.org> wrote:

> Hans, Tom,
>
> I've recently thought about similar problems - I would approach this
> similarly. Probably have users in Pybossa draw tables, columns and rows
> - split out all individual cells - OCR them and have them corrected in a
>   seperate pybossa app.
>
> OpenCV is great to think about - you could probably automate some of it (if
> there are clear markings ...
>
> Michael
>
> On Thu, Feb 07, 2013 at 11:12:48AM -0500, Tom Morris wrote:
> > If your goal is to get the data, services like Captricity have this as a
> > standard offering.
> >
> > If you actually want to build software to do this, I'd recommend using
> > something like OpenCV to generate a guess at segmentation and then have
> the
> > users either approve or correct it. There's a working code example for a
> > similar thing here:
> >
> http://stackoverflow.com/questions/10196198/how-to-remove-convexity-defects-in-sudoku-square/10226971#10226971(scroll
> > down for the Python version)
> >
> > OpenCV has Python bindings and, yes, you could use PyBossa to build this
> > type of service (although you'd probably have to host it yourself if you
> > wanted to make use of third party libraries such as OpenCV).
> >
> > Tom
> >
> > On Thu, Feb 7, 2013 at 10:04 AM, Hans Thompson <hans.thompson1 at gmail.com
> >wrote:
> >
> > >
> > > Hello open data crusaders. I hope I am properly following the mailing
> list
> > > rules as a newcomer and programming neophyte (some conversational R and
> > > learning python at the moment).
> > >
> > > I want to build a microtasking project to take pdf "pictures" of tables
> > > and break them into rows and columns.  This way each cell can be a
> > > transcription task with a cell identity.
> > >
> > > I've thought a lot on how to do this with R (because a superior QC
> process
> > > could be implemented easier from my personal experiance) but it lacks
> the
> > > kind of picture manipulation tools that I am supposing aleady exist for
> > > python etc.
> > >
> > > My question:  could pybossa be used to return the rows and column of an
> > > image array from user call from a click? So the user could click for
> each
> > > space between row and column and split the table picture into a table
> of
> > > pictures?
> > >
> > > Does a better tool exist for this type of task?
> > >
> > > Thanks.
> > > Hans Thompson
> > >
> > > _______________________________________________
> > > okfn-labs mailing list
> > > okfn-labs at lists.okfn.org
> > > http://lists.okfn.org/mailman/listinfo/okfn-labs
> > > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
> > >
> > >
>
> > _______________________________________________
> > okfn-labs mailing list
> > okfn-labs at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/okfn-labs
> > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>
> --
> Data Wrangler with the Open Knowledge Foundation (OKFN.org)
> GPG/PGP key: http://tentacleriot.eu/mihi.asc
> Twitter: @mihi_tr Skype: mihi_tr
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130207/7ba9d3e2/attachment-0002.html>