[okfn-labs] Questions on data scraping tables within "picture" pdfs.

Tom Morris tfmorris at gmail.com
Fri Feb 8 14:11:45 UTC 2013


On Thu, Feb 7, 2013 at 8:20 PM, Hans Thompson <hans.thompson1 at gmail.com>wrote:

> I realize now I could have been more specific about my specifications.
> Sorry about that.  The documents I am trying to covert are not forms with
> standard formatting on each page or tables with lines already between
> rows/columns.
>

More information is almost always better.  Examples are the best if you can
provide them.


> The documents I have in mind for this project is the Comprehensive Annual
> Finance Report (CAFR) for US cities.  The content of the tables is
> standardized by the Govermental Accounting Standards Board (GASB) and
> follow Generally Accepted Accounting Principals (GAAP) so comparison of
> accounts by city or year are possible.  Each city publishes their CAFR
> independently though which necessitates a general tool to break apart
> tables without Computer Vision for recognizing tables.  I'd like to use a
> series of microtasks with some quality control. OCR of the final cells
> seems smart.
>

They don't use XBRL for any reporting do they?  That would make things much
easier.

As for PDFs, the few samples that I looked at were all standard text PDFs
with embedded tables, not scanned image PDFs.  Before resorting to OCR, I'd
look at processing these in the PDF text domain.  Even if you had to resort
to OCR, I'd first just try a standard high end OCR package.  They can do
some reasonably sophisticated layout analysis and I wouldn't rule them out
without running some experiments.

If you've already collected a corpus of documents, perhaps you could point
people at it and you could get some more concrete suggestions based on the
actual documents that you want analyzed.

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130208/49090953/attachment-0002.html>


More information about the okfn-labs mailing list