[okfn-labs] Questions on data scraping tables within "picture" pdfs.

Daniel Lombraña González teleyinex at gmail.com
Mon Feb 11 07:43:01 UTC 2013


Hi there,

Do you know Tabula? (http://tabula.nerdpower.org/) Have a look at it, and
maybe you can contact the author, as they do an amazing job in transcribing
data directly from PDF files. I think, that it could be worthy to talk to
Manuel as his software does it really well.

Regarding the cell control: that would be even better :-) it will be
slower, but the quality will be by all means much better. As I said, I'm
still thinking that the first task could be to ask users to detect the
table structure, and then once you have an answer, use it for populating it
by row, column or cell.

Cheers,

Daniel


On Sat, Feb 9, 2013 at 2:55 AM, Hans Thompson <hans.thompson1 at gmail.com>wrote:

>    I still need to get to spending some time with all the python related
> links but I wanted to share some more.  Thanks for the links Rufus. Sounds
> like it will require some time getting into JavaScript.
>
>     I started this project out just asking my city controller's department
> (Anchorage, AK) if they could give me the past year's CAFR data in a
> tabular form and if they would make it public.  After they said this wasn't
> going to be possible because the IT department is very busy, I started
> talking to my elected officials to get more answers.  What I learned from
> the process was that my local government has little interest in publishing
> financial data because as they said as the main argument against the idea,
> and I'm generously paraphrasing, "it could be manipulated and used against
> us".  What makes sense to me though is that there is a legal risk if
> financial data isn't transcribed by a scientific and quality protected
> process to ensure its integrity.
>
>     Perhaps any method of OCR or CV could seen legal terms to be
> "manipulating the data".  I'm not a lawyer.  That's just a concern that I
> have with automated processes for this set of tasks.  The city CIO who is
> actually very friendly to open data and a sophisticated dude talked about
> this problem and I now agree with him that any inaccuracies between the
> official copy and the transcription would make the whole converted dataset
> invalid. Which says to me that I should avoid OCR for recognizing table
> dimensions.
>
>     Daniel, I found the Brazilian project when I first got curious about
> using microtasks to convert tables about nine months ago.  I think it is
> the best current option (open source or otherwise) using microtasking to
> convert tables with various dimensions and different page orientations.
> However there are some advantages to splitting the tasks up into the
> smaller individual cell chunks.  Converting the whole table as a microtask
> will be require more focus on following orientation then transcription and
> the eye has to do more work.  Also, having individual cells as tasks will
> allow a greater control over Quality Control.  Each cell can be assessed
> for correctness easier and retested easier if there is a inconsistency.
>
>    Tom, I just read that COGNOS is offering a XBML publishing service so
> maybe I can petition that change since the city is in the midst of
> converting over to a COGNOS system.  Interesting.
>
>     There are some cites with greater disclosure within their pdf's then
> others.  This tool I am trying to create is for the pdfs that are
> trouble..  And I simply don't trust high end OCR (I'm thinking ABBY or
> Adobe). I'm not sure what the best way to take ALL the data from each city
> would be but I am most interested in Anchorage, AK currently because they
> use the paper scan method and that's where I live. They are definitely not
> the only city to use this method but I can't refer you to any others I've
> recorded at the moment because I'm away on vacation and staying warm.
> Here's quick link to the Anchorage CAFR. The table's that I want the most
> are in the Statistics section:
>
> http://www.muni.org/Departments/finance/controller/Pages/CAFR.aspx
>
>     The first 15 tables in each year's Statistical Section interest me the
> most because they have the best congruency across US cities.
>
>     Below is a general process I would like to implement with some code in
> R I wrote that will divide .jpg tables into individual cells.  I can't find
> how to open other image file types in R and manipulate them.  I'm most
> familiar with R right now if you are scratching your head asking "why is
> this stranger using R?". Two small problems I need to work out with the
> part I wrote.
>
> 1. The images are croping with some extra pixels and I don't know why yet.
>
> 2. You need to close R after running the script because I am still
> learning how to turn off the active graphics device.
>
> The limitations I've found with using R for this is leading me to learning
> python and maybe come back to it for QC.
>
> #TABLE CONVERSION
> #1. Convert pdf to images (page sized)
> #2 Parse out pages with no tables (Microtask #1).  Create new subset
> object of all the  table that did have tables together.
> #3 Go through each page and use a croping tool to select the body of the
> table and crop the title of the table.
> #4  Put both image objects together in a with their designations (like a
> list in R).
> #5 Begin R Code to split table images (.jpg)
> ##START##
> ##Load needed package for reading .jpgs.  I'm having trouble find any
> other packages to handle other image file types or get them to work
> library(ReadImages)
> #file = the file path in quotations
> mygraph <- read.jpeg(file)
> plot(mygraph)
> ##Find Columns and Rows
> # User Call for calling each rows.  click for the top of the first row and
> bottom of the last
> rows <- locator(type='p',pch=3,col='red',lwd=1.2,cex=1.2)
> # User Call for calling each column.  click for the left of the first row
> and right of the last.
> cols <- locator(type='p',pch=3,col='black',lwd=1.2,cex=1.2)
> #create data frame
> rows <- as.data.frame(rows)[,2]
> cols <- as.data.frame(cols)[,1]
> #add lower and upper bounds of image
> #rows <- c(0,rows,dim(mygraph)[2])
> #cols <- c(0,cols,dim(mygraph)[1])
> ##create a table of cells (called orederedcellsdf)
> rowstart <-  rep( rows[1:length(rows)-1] , length(cols)-1)
> rowend <-  rep( rows[2:length(rows)] , length(cols)-1)
> colstart <-  rep( cols[1:length(cols)-1] , length(rows)-1)
> colend <-  rep( cols[2:length(cols)] , length(rows)-1)
> orderedcellsdf <- data.frame(rowstart,rowend,colstart,colend)
> ##names rows in the table of cells
> nrows <- length(rows)-1
> ncols <- length(cols)-1
> colrep <-  rep(LETTERS[(1:(ncols))],nrows) ;length(colrep)
> rowrep <-  rep(1:nrows,each=ncols) ; length(rowrep)
> rownamesdf <- paste(rowrep,sep="", colrep)
> rownames(orderedcellsdf) <- rownamesdf
> for ( i in 1:length(orderedcellsdf[,1])){
> jpeg(paste(  rownames(orderedcellsdf[i,]) , sep = "" , ".jpg" ))
> plot(mygraph,ylim= as.numeric(orderedcellsdf[i,2:1]),xlim=
> as.numeric(orderedcellsdf[i,3:4]))
> }
> #dev.off to specifications. I need to figure this out.
> ##END##
> #6 Now that these cell images are split with their designations, they can
> be fed to takers individually.
> # Now each cell could be sent though OCR or transcribed by human.  or why
> not BOTH!? :-O
>
>
>
>
> On Fri, Feb 8, 2013 at 6:11 AM, Tom Morris <tfmorris at gmail.com> wrote:
>
>> On Thu, Feb 7, 2013 at 8:20 PM, Hans Thompson <hans.thompson1 at gmail.com>wrote:
>>
>>> I realize now I could have been more specific about my specifications.
>>> Sorry about that.  The documents I am trying to covert are not forms with
>>> standard formatting on each page or tables with lines already between
>>> rows/columns.
>>>
>>
>> More information is almost always better.  Examples are the best if you
>> can provide them.
>>
>>
>>> The documents I have in mind for this project is the Comprehensive
>>> Annual Finance Report (CAFR) for US cities.  The content of the tables is
>>> standardized by the Govermental Accounting Standards Board (GASB) and
>>> follow Generally Accepted Accounting Principals (GAAP) so comparison of
>>> accounts by city or year are possible.  Each city publishes their CAFR
>>> independently though which necessitates a general tool to break apart
>>> tables without Computer Vision for recognizing tables.  I'd like to use a
>>> series of microtasks with some quality control. OCR of the final cells
>>> seems smart.
>>>
>>
>> They don't use XBRL for any reporting do they?  That would make things
>> much easier.
>>
>> As for PDFs, the few samples that I looked at were all standard text PDFs
>> with embedded tables, not scanned image PDFs.  Before resorting to OCR, I'd
>> look at processing these in the PDF text domain.  Even if you had to resort
>> to OCR, I'd first just try a standard high end OCR package.  They can do
>> some reasonably sophisticated layout analysis and I wouldn't rule them out
>> without running some experiments.
>>
>> If you've already collected a corpus of documents, perhaps you could
>> point people at it and you could get some more concrete suggestions based
>> on the actual documents that you want analyzed.
>>
>> Tom
>>
>
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>


-- 
··························································································································································
http://daniellombrana.es
http://www.flickr.com/photos/teleyinex
··························································································································································
Por favor, NO utilice formatos de archivo propietarios para el
intercambio de documentos, como DOC y XLS, sino PDF, HTML, RTF, TXT, CSV
o cualquier otro que no obligue a utilizar un programa de un
fabricante concreto para tratar la información contenida en él.
··························································································································································
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130211/ed9fcb06/attachment-0002.html>


More information about the okfn-labs mailing list