[okfn-labs] Questions on data scraping tables within "picture" pdfs.

Daniel Lombraña González teleyinex at gmail.com
Fri Feb 8 08:17:03 UTC 2013


Dear Hans,

What you have described is something where some colleagues from Brazil are
already working and that could be done "easily" in PyBossa. If your data
forms have the same structure all the time, then, it is easier as all you
have to do is to ask the volunteers to fill a table that you will be
providing.

If the table changes from page to page, then, you will have to build a
complex solution :-) Basically, the project will involve to show the users
the PDF page (as an image or using this template created for transcribing
PDFs <http://crowdcrafting.org/app/pdftranscribe>) and then ask them to do
several things:

1.- Draw the table structure (you can use libraries like Raphael.JS for
this step, or implement your own for creating data grids that mimic the
table structure check datatables <http://www.datatables.net/>)
2.- Based on the answers of the users, you will be able to "automatically"
create the most voted structure for the current page and feed it to a next
app that will show the page and ask the users to fill in the data.

You can also, as several people have already suggested you, to run an OCR
previously in the pages that you want to analyze and then export the
created table by the software as a task in PyBossa, so the users could
validate the results of the OCR (or even fix them). You can use in this
case PyBossa as a training system for your OCR if you want :-)

Using an OCR will help you if the data in the tables has a good structure,
the data is not handwritten, etc, etc. I would recommend you to test
several OCR algorithms, if you like the results then you will be done :-),
else you will need to think an hybrid solution of OCR + Humans or simply
Humans.

PyBossa will help you to send the pages to different users, so you can then
analyze the returned data by them and create a "canonical" solution (i.e.
the most voted structure, the most selected picture, etc.).

If you need help regarding this application, please, do not hesitate and
contact me again as I'll be really happy to talk with you and try to build
the app.

Best regards,

Daniel


On Fri, Feb 8, 2013 at 1:20 AM, Hans Thompson <hans.thompson1 at gmail.com>wrote:

> Thanks Everyone!
>
> I realize now I could have been more specific about my specifications.
> Sorry about that.  The documents I am trying to covert are not forms with
> standard formatting on each page or tables with lines already between
> rows/columns.
>
> I'll try using the RMagick script when I get the install working (thanks
> Marian) to split the tables up by lines I draw.  What do people think would
> be the best way to draw these lines quickly for many tables without a
> standard size or placement on a page?
>
> I'd also like to use the PyBossa API (thanks Michael) further but I
> haven't seen any tasks that would involve clicking on the image to return
> the pixel coordinates within the image.  Do people think that that kind of
> task would be easily incorporated into the current PyBossa?
>
> I will also take a look at HTML5 (thanks Stefan) cause this seems coolest
> but my own limited knowledge is going to hold me back here.
>
> The documents I have in mind for this project is the Comprehensive Annual
> Finance Report (CAFR) for US cities.  The content of the tables is
> standardized by the Govermental Accounting Standards Board (GASB) and
> follow Generally Accepted Accounting Principals (GAAP) so comparison of
> accounts by city or year are possible.  Each city publishes their CAFR
> independently though which necessitates a general tool to break apart
> tables without Computer Vision for recognizing tables.  I'd like to use a
> series of microtasks with some quality control. OCR of the final cells
> seems smart.
>
> I'll try out RMagick first.
>
> Also thanks Tom for showing me Captricity.  I'll send in a complex example
> and see how well they do!
>
> Hans Thompson
>
> On Feb 7, 2013 11:17 AM, "Michael Bauer" <michael.bauer at okfn.org> wrote:
>
>> Hans, Tom,
>>
>> I've recently thought about similar problems - I would approach this
>> similarly. Probably have users in Pybossa draw tables, columns and rows
>> - split out all individual cells - OCR them and have them corrected in a
>>   seperate pybossa app.
>>
>> OpenCV is great to think about - you could probably automate some of it
>> (if
>> there are clear markings ...
>>
>> Michael
>>
>> On Thu, Feb 07, 2013 at 11:12:48AM -0500, Tom Morris wrote:
>> > If your goal is to get the data, services like Captricity have this as a
>> > standard offering.
>> >
>> > If you actually want to build software to do this, I'd recommend using
>> > something like OpenCV to generate a guess at segmentation and then have
>> the
>> > users either approve or correct it. There's a working code example for a
>> > similar thing here:
>> >
>> http://stackoverflow.com/questions/10196198/how-to-remove-convexity-defects-in-sudoku-square/10226971#10226971(scroll
>> > down for the Python version)
>> >
>> > OpenCV has Python bindings and, yes, you could use PyBossa to build this
>> > type of service (although you'd probably have to host it yourself if you
>> > wanted to make use of third party libraries such as OpenCV).
>> >
>> > Tom
>> >
>> > On Thu, Feb 7, 2013 at 10:04 AM, Hans Thompson <
>> hans.thompson1 at gmail.com>wrote:
>> >
>> > >
>> > > Hello open data crusaders. I hope I am properly following the mailing
>> list
>> > > rules as a newcomer and programming neophyte (some conversational R
>> and
>> > > learning python at the moment).
>> > >
>> > > I want to build a microtasking project to take pdf "pictures" of
>> tables
>> > > and break them into rows and columns.  This way each cell can be a
>> > > transcription task with a cell identity.
>> > >
>> > > I've thought a lot on how to do this with R (because a superior QC
>> process
>> > > could be implemented easier from my personal experiance) but it lacks
>> the
>> > > kind of picture manipulation tools that I am supposing aleady exist
>> for
>> > > python etc.
>> > >
>> > > My question:  could pybossa be used to return the rows and column of
>> an
>> > > image array from user call from a click? So the user could click for
>> each
>> > > space between row and column and split the table picture into a table
>> of
>> > > pictures?
>> > >
>> > > Does a better tool exist for this type of task?
>> > >
>> > > Thanks.
>> > > Hans Thompson
>> > >
>> > > _______________________________________________
>> > > okfn-labs mailing list
>> > > okfn-labs at lists.okfn.org
>> > > http://lists.okfn.org/mailman/listinfo/okfn-labs
>> > > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>> > >
>> > >
>>
>> > _______________________________________________
>> > okfn-labs mailing list
>> > okfn-labs at lists.okfn.org
>> > http://lists.okfn.org/mailman/listinfo/okfn-labs
>> > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>>
>>
>> --
>> Data Wrangler with the Open Knowledge Foundation (OKFN.org)
>> GPG/PGP key: http://tentacleriot.eu/mihi.asc
>> Twitter: @mihi_tr Skype: mihi_tr
>>
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>


-- 
··························································································································································
http://daniellombrana.es
http://www.flickr.com/photos/teleyinex
··························································································································································
Por favor, NO utilice formatos de archivo propietarios para el
intercambio de documentos, como DOC y XLS, sino PDF, HTML, RTF, TXT, CSV
o cualquier otro que no obligue a utilizar un programa de un
fabricante concreto para tratar la información contenida en él.
··························································································································································
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130208/ef558c29/attachment-0002.html>


More information about the okfn-labs mailing list