[okfn-labs] Questions on data scraping tables within "picture" pdfs.

Rufus Pollock rufus.pollock at okfn.org
Fri Feb 8 10:25:35 UTC 2013


On 8 February 2013 01:20, Hans Thompson <hans.thompson1 at gmail.com> wrote:
>
> Thanks Everyone!
>
> I realize now I could have been more specific about my specifications.  Sorry about that.  The documents I am trying to covert are not forms with standard formatting on each page or tables with lines already between rows/columns.
>
> I'll try using the RMagick script when I get the install working (thanks Marian) to split the tables up by lines I draw.  What do people think would be the best way to draw these lines quickly for many tables without a standard size or placement on a page?
>
> I'd also like to use the PyBossa API (thanks Michael) further but I haven't seen any tasks that would involve clicking on the image to return the pixel coordinates within the image.  Do people think that that kind of task would be easily incorporated into the current PyBossa?

Yes, that would be very easy. I note that the demo pdf transcribe
application uses pdf.js: https://github.com/PyBossa/pdftranscribe

pdf.js + some javascripts supports stuff like page numbers and click
locations, see [1]

[1]: https://groups.google.com/forum/?fromgroups=#!topic/mozilla.dev.pdf-js/U0Pk9cJp9Pc

> I will also take a look at HTML5 (thanks Stefan) cause this seems coolest but my own limited knowledge is going to hold me back here.
>
> The documents I have in mind for this project is the Comprehensive Annual Finance Report (CAFR) for US cities.  The content of the tables is standardized by the Govermental Accounting Standards Board (GASB) and follow Generally Accepted Accounting Principals (GAAP) so comparison of accounts by city or year are possible.  Each city publishes their CAFR independently though which necessitates a general tool to break apart tables without Computer Vision for recognizing tables.  I'd like to use a series of microtasks with some quality control. OCR of the final cells seems smart.

This sounds really interesting :-) I'm personally very interested in
US city accounts [2] and also imagine this could be very useful in
relation to OpenSpending - http://openspending.org/.

Rufus

[2]: http://notebook.okfn.org/2012/08/05/spending-story-california-city-bankruptcies/

> I'll try out RMagick first.
>
> Also thanks Tom for showing me Captricity.  I'll send in a complex example and see how well they do!
>
> Hans Thompson
>
> On Feb 7, 2013 11:17 AM, "Michael Bauer" <michael.bauer at okfn.org> wrote:
>>
>> Hans, Tom,
>>
>> I've recently thought about similar problems - I would approach this
>> similarly. Probably have users in Pybossa draw tables, columns and rows
>> - split out all individual cells - OCR them and have them corrected in a
>>   seperate pybossa app.
>>
>> OpenCV is great to think about - you could probably automate some of it (if
>> there are clear markings ...
>>
>> Michael
>>
>> On Thu, Feb 07, 2013 at 11:12:48AM -0500, Tom Morris wrote:
>> > If your goal is to get the data, services like Captricity have this as a
>> > standard offering.
>> >
>> > If you actually want to build software to do this, I'd recommend using
>> > something like OpenCV to generate a guess at segmentation and then have the
>> > users either approve or correct it. There's a working code example for a
>> > similar thing here:
>> > http://stackoverflow.com/questions/10196198/how-to-remove-convexity-defects-in-sudoku-square/10226971#10226971(scroll
>> > down for the Python version)
>> >
>> > OpenCV has Python bindings and, yes, you could use PyBossa to build this
>> > type of service (although you'd probably have to host it yourself if you
>> > wanted to make use of third party libraries such as OpenCV).
>> >
>> > Tom
>> >
>> > On Thu, Feb 7, 2013 at 10:04 AM, Hans Thompson <hans.thompson1 at gmail.com>wrote:
>> >
>> > >
>> > > Hello open data crusaders. I hope I am properly following the mailing list
>> > > rules as a newcomer and programming neophyte (some conversational R and
>> > > learning python at the moment).
>> > >
>> > > I want to build a microtasking project to take pdf "pictures" of tables
>> > > and break them into rows and columns.  This way each cell can be a
>> > > transcription task with a cell identity.
>> > >
>> > > I've thought a lot on how to do this with R (because a superior QC process
>> > > could be implemented easier from my personal experiance) but it lacks the
>> > > kind of picture manipulation tools that I am supposing aleady exist for
>> > > python etc.
>> > >
>> > > My question:  could pybossa be used to return the rows and column of an
>> > > image array from user call from a click? So the user could click for each
>> > > space between row and column and split the table picture into a table of
>> > > pictures?
>> > >
>> > > Does a better tool exist for this type of task?
>> > >
>> > > Thanks.
>> > > Hans Thompson
>> > >
>> > > _______________________________________________
>> > > okfn-labs mailing list
>> > > okfn-labs at lists.okfn.org
>> > > http://lists.okfn.org/mailman/listinfo/okfn-labs
>> > > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>> > >
>> > >
>>
>> > _______________________________________________
>> > okfn-labs mailing list
>> > okfn-labs at lists.okfn.org
>> > http://lists.okfn.org/mailman/listinfo/okfn-labs
>> > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>>
>>
>> --
>> Data Wrangler with the Open Knowledge Foundation (OKFN.org)
>> GPG/PGP key: http://tentacleriot.eu/mihi.asc
>> Twitter: @mihi_tr Skype: mihi_tr
>
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>




More information about the okfn-labs mailing list