[School-of-data] 'Tabula helps you liberate data tables trapped inside evil PDFs'

Tom Morris tfmorris at gmail.com
Sat Apr 6 22:30:46 UTC 2013


On Sat, Apr 6, 2013 at 5:45 PM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:

>
> On Sat, Apr 6, 2013 at 10:01 PM, Tom Morris <tfmorris at gmail.com> wrote:
>
>>
>> People who are interested in this will likely be interested in the
>> ICDAR 2013 Table competition which is underway now and its associated
>> corpus.
>> http://www.tamirhassan.com/competition.html
>>
>>
>> Thanks for this. I would certainly be interested in any code which was
> openly re-usable. It's valuable to have more than one tool anyway.
>

I think the corpus is valuable independent of any of the contest
submissions.  It has extracts from 67 documents containing a variety of
different table types along with ground truth information about the tables
and a scoring methodology and tool set.  It allows one to take a data
driven approach to evaluating tools and algorithms.

The one drawback of the corpus for this forum is that it's entirely
government documents from the EU & US, so it's not very representative of
scientific publications.  Is there a similar corpus for journal articles
(or any effort underway to produce one)?

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20130406/c822f256/attachment-0001.html>


More information about the school-of-data mailing list