[School-of-data] 'Tabula helps you liberate data tables trapped inside evil PDFs'

Peter Murray-Rust pm286 at cam.ac.uk
Sat Apr 6 21:45:47 UTC 2013


On Sat, Apr 6, 2013 at 10:01 PM, Tom Morris <tfmorris at gmail.com> wrote:

> Tabula's principal contribution seems to be the web uploading
> interface and queuing mechanism.  It doesn't do page segmentation or
> table identification and it's table processing seems somewhat
> rudimentary.
>

Everyone starts somewhere. For me the main virtue is that it's Open,
enthusiastic.


>
> People who are interested in this will likely be interested in the
> ICDAR 2013 Table competition which is underway now and its associated
> corpus.
> http://www.tamirhassan.com/competition.html
>
>
> Thanks for this. I would certainly be interested in any code which was
openly re-usable. It's valuable to have more than one tool anyway.


>
> Correct, they don't do any page segmentation or table identification.
> The table boundaries need to be hand-drawn for each table and the
> resulting CSV data copied individually.  It would be pretty tedious
> for a paper with lots of tables.
>

But this combines very well with ami2 (bitbucket.org/svg2xml-dev) which
does page segmentation and can identify tables from captions. So between
the two we are a long way down the road.



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20130406/4ab2ad98/attachment-0001.html>


More information about the school-of-data mailing list