[School-of-data] 'Tabula helps you liberate data tables trapped inside evil PDFs'

Tom Morris tfmorris at gmail.com
Sat Apr 6 21:01:38 UTC 2013


Tabula's principal contribution seems to be the web uploading
interface and queuing mechanism.  It doesn't do page segmentation or
table identification and it's table processing seems somewhat
rudimentary.

People who are interested in this will likely be interested in the
ICDAR 2013 Table competition which is underway now and its associated
corpus.
http://www.tamirhassan.com/competition.html


On Sat, Apr 6, 2013 at 1:02 AM, Eldan Goldenberg <eldang at gmail.com> wrote:

> I haven't tried it yet, but it doesn't look like the process for self-hosting on a local OSX or Linux box is too onerous:
> https://github.com/jazzido/tabula/blob/master/README.md#manual-installation-os-x-or-linux

The directions are really only for OS X, not Linux.  I just did an
Ubuntu installation and it requires building the latest OpenCV & MuPDF
from source, so it's not as simple as using a package manager, but
it's doable given 750 MB of free disk space and a free hour of time.
I'll contribute the Ubuntu directions back in a pull request to make
it easier for the next person, but they could simplify matters greatly
with a little attention to the dependencies.

On Sat, Apr 6, 2013 at 3:16 AM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:

> [Tabula] still (I think) needs tables fed one at a time

Correct, they don't do any page segmentation or table identification.
The table boundaries need to be hand-drawn for each table and the
resulting CSV data copied individually.  It would be pretty tedious
for a paper with lots of tables.

It's got some other quirks (e.g. putting question marks in empty
cells) and limitations (e.g. word segmentation is dodgy).  It would be
interesting to see how its quality compares to PDF2Table or some of
the commercial alternatives (a good meta project).

Tom




More information about the school-of-data mailing list