[School-of-data] 'Tabula helps you liberate data tables trapped inside evil PDFs'

Peter Murray-Rust pm286 at cam.ac.uk
Sun Apr 7 08:02:27 UTC 2013

On Sat, Apr 6, 2013 at 11:30 PM, Tom Morris <tfmorris at gmail.com> wrote:

> I think the corpus is valuable independent of any of the contest
> submissions.  It has extracts from 67 documents containing a variety of
> different table types along with ground truth information about the tables
> and a scoring methodology and tool set.  It allows one to take a data
> driven approach to evaluating tools and algorithms.

I agree this is valuable as well.

> The one drawback of the corpus for this forum is that it's entirely
> government documents from the EU & US, so it's not very representative of
> scientific publications.  Is there a similar corpus for journal articles
> (or any effort underway to produce one)?

It's very difficult to get mainstream publishers to allow material to be
published as a corpus - we tried this some year ago with chemistry where 3
publishers allowed us to use content as long as we didn't publish it. So
it's not infrequent to find studies where the corpus is not available.

We have analyzed 500,000 chemical reactions in patents - with an element of
ground truth from complementary chemical information in the paper. But
almost all other publications are closed so we haven't been able to analyze

We shall be exposing tables with AMI2 and it's probable that the CC-BY
publishers have a wide enough spread of table types. We can certainly
compare out findings with the tamirhassan corpus.


Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20130407/b2e03f9d/attachment-0001.html>

More information about the school-of-data mailing list