[okfn-labs] Questions on data scraping tables within "picture" pdfs.

Hans Thompson hans.thompson1 at gmail.com
Sat Feb 9 01:55:54 UTC 2013


   I still need to get to spending some time with all the python related
links but I wanted to share some more.  Thanks for the links Rufus. Sounds
like it will require some time getting into JavaScript.

    I started this project out just asking my city controller's department
(Anchorage, AK) if they could give me the past year's CAFR data in a
tabular form and if they would make it public.  After they said this wasn't
going to be possible because the IT department is very busy, I started
talking to my elected officials to get more answers.  What I learned from
the process was that my local government has little interest in publishing
financial data because as they said as the main argument against the idea,
and I'm generously paraphrasing, "it could be manipulated and used against
us".  What makes sense to me though is that there is a legal risk if
financial data isn't transcribed by a scientific and quality protected
process to ensure its integrity.

    Perhaps any method of OCR or CV could seen legal terms to be
"manipulating the data".  I'm not a lawyer.  That's just a concern that I
have with automated processes for this set of tasks.  The city CIO who is
actually very friendly to open data and a sophisticated dude talked about
this problem and I now agree with him that any inaccuracies between the
official copy and the transcription would make the whole converted dataset
invalid. Which says to me that I should avoid OCR for recognizing table
dimensions.

    Daniel, I found the Brazilian project when I first got curious about
using microtasks to convert tables about nine months ago.  I think it is
the best current option (open source or otherwise) using microtasking to
convert tables with various dimensions and different page orientations.
However there are some advantages to splitting the tasks up into the
smaller individual cell chunks.  Converting the whole table as a microtask
will be require more focus on following orientation then transcription and
the eye has to do more work.  Also, having individual cells as tasks will
allow a greater control over Quality Control.  Each cell can be assessed
for correctness easier and retested easier if there is a inconsistency.

   Tom, I just read that COGNOS is offering a XBML publishing service so
maybe I can petition that change since the city is in the midst of
converting over to a COGNOS system.  Interesting.

    There are some cites with greater disclosure within their pdf's then
others.  This tool I am trying to create is for the pdfs that are
trouble..  And I simply don't trust high end OCR (I'm thinking ABBY or
Adobe). I'm not sure what the best way to take ALL the data from each city
would be but I am most interested in Anchorage, AK currently because they
use the paper scan method and that's where I live. They are definitely not
the only city to use this method but I can't refer you to any others I've
recorded at the moment because I'm away on vacation and staying warm.
Here's quick link to the Anchorage CAFR. The table's that I want the most
are in the Statistics section:

http://www.muni.org/Departments/finance/controller/Pages/CAFR.aspx

    The first 15 tables in each year's Statistical Section interest me the
most because they have the best congruency across US cities.

    Below is a general process I would like to implement with some code in
R I wrote that will divide .jpg tables into individual cells.  I can't find
how to open other image file types in R and manipulate them.  I'm most
familiar with R right now if you are scratching your head asking "why is
this stranger using R?". Two small problems I need to work out with the
part I wrote.

1. The images are croping with some extra pixels and I don't know why yet.

2. You need to close R after running the script because I am still learning
how to turn off the active graphics device.

The limitations I've found with using R for this is leading me to learning
python and maybe come back to it for QC.

#TABLE CONVERSION
#1. Convert pdf to images (page sized)
#2 Parse out pages with no tables (Microtask #1).  Create new subset object
of all the  table that did have tables together.
#3 Go through each page and use a croping tool to select the body of the
table and crop the title of the table.
#4  Put both image objects together in a with their designations (like a
list in R).
#5 Begin R Code to split table images (.jpg)
##START##
##Load needed package for reading .jpgs.  I'm having trouble find any other
packages to handle other image file types or get them to work
library(ReadImages)
#file = the file path in quotations
mygraph <- read.jpeg(file)
plot(mygraph)
##Find Columns and Rows
# User Call for calling each rows.  click for the top of the first row and
bottom of the last
rows <- locator(type='p',pch=3,col='red',lwd=1.2,cex=1.2)
# User Call for calling each column.  click for the left of the first row
and right of the last.
cols <- locator(type='p',pch=3,col='black',lwd=1.2,cex=1.2)
#create data frame
rows <- as.data.frame(rows)[,2]
cols <- as.data.frame(cols)[,1]
#add lower and upper bounds of image
#rows <- c(0,rows,dim(mygraph)[2])
#cols <- c(0,cols,dim(mygraph)[1])
##create a table of cells (called orederedcellsdf)
rowstart <-  rep( rows[1:length(rows)-1] , length(cols)-1)
rowend <-  rep( rows[2:length(rows)] , length(cols)-1)
colstart <-  rep( cols[1:length(cols)-1] , length(rows)-1)
colend <-  rep( cols[2:length(cols)] , length(rows)-1)
orderedcellsdf <- data.frame(rowstart,rowend,colstart,colend)
##names rows in the table of cells
nrows <- length(rows)-1
ncols <- length(cols)-1
colrep <-  rep(LETTERS[(1:(ncols))],nrows) ;length(colrep)
rowrep <-  rep(1:nrows,each=ncols) ; length(rowrep)
rownamesdf <- paste(rowrep,sep="", colrep)
rownames(orderedcellsdf) <- rownamesdf
for ( i in 1:length(orderedcellsdf[,1])){
jpeg(paste(  rownames(orderedcellsdf[i,]) , sep = "" , ".jpg" ))
plot(mygraph,ylim= as.numeric(orderedcellsdf[i,2:1]),xlim=
as.numeric(orderedcellsdf[i,3:4]))
}
#dev.off to specifications. I need to figure this out.
##END##
#6 Now that these cell images are split with their designations, they can
be fed to takers individually.
# Now each cell could be sent though OCR or transcribed by human.  or why
not BOTH!? :-O




On Fri, Feb 8, 2013 at 6:11 AM, Tom Morris <tfmorris at gmail.com> wrote:

> On Thu, Feb 7, 2013 at 8:20 PM, Hans Thompson <hans.thompson1 at gmail.com>wrote:
>
>> I realize now I could have been more specific about my specifications.
>> Sorry about that.  The documents I am trying to covert are not forms with
>> standard formatting on each page or tables with lines already between
>> rows/columns.
>>
>
> More information is almost always better.  Examples are the best if you
> can provide them.
>
>
>> The documents I have in mind for this project is the Comprehensive Annual
>> Finance Report (CAFR) for US cities.  The content of the tables is
>> standardized by the Govermental Accounting Standards Board (GASB) and
>> follow Generally Accepted Accounting Principals (GAAP) so comparison of
>> accounts by city or year are possible.  Each city publishes their CAFR
>> independently though which necessitates a general tool to break apart
>> tables without Computer Vision for recognizing tables.  I'd like to use a
>> series of microtasks with some quality control. OCR of the final cells
>> seems smart.
>>
>
> They don't use XBRL for any reporting do they?  That would make things
> much easier.
>
> As for PDFs, the few samples that I looked at were all standard text PDFs
> with embedded tables, not scanned image PDFs.  Before resorting to OCR, I'd
> look at processing these in the PDF text domain.  Even if you had to resort
> to OCR, I'd first just try a standard high end OCR package.  They can do
> some reasonably sophisticated layout analysis and I wouldn't rule them out
> without running some experiments.
>
> If you've already collected a corpus of documents, perhaps you could point
> people at it and you could get some more concrete suggestions based on the
> actual documents that you want analyzed.
>
> Tom
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130208/951b7d4d/attachment-0002.html>


More information about the okfn-labs mailing list