[School-of-data] 'Tabula helps you liberate data tables trapped inside evil PDFs'
Peter Murray-Rust
pm286 at cam.ac.uk
Sat Apr 6 21:45:47 UTC 2013
On Sat, Apr 6, 2013 at 10:01 PM, Tom Morris <tfmorris at gmail.com> wrote:
> Tabula's principal contribution seems to be the web uploading
> interface and queuing mechanism. It doesn't do page segmentation or
> table identification and it's table processing seems somewhat
> rudimentary.
>
Everyone starts somewhere. For me the main virtue is that it's Open,
enthusiastic.
>
> People who are interested in this will likely be interested in the
> ICDAR 2013 Table competition which is underway now and its associated
> corpus.
> http://www.tamirhassan.com/competition.html
>
>
> Thanks for this. I would certainly be interested in any code which was
openly re-usable. It's valuable to have more than one tool anyway.
>
> Correct, they don't do any page segmentation or table identification.
> The table boundaries need to be hand-drawn for each table and the
> resulting CSV data copied individually. It would be pretty tedious
> for a paper with lots of tables.
>
But this combines very well with ami2 (bitbucket.org/svg2xml-dev) which
does page segmentation and can identify tables from captions. So between
the two we are a long way down the road.
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20130406/4ab2ad98/attachment-0001.html>
More information about the school-of-data
mailing list