[open-science] New PDF Table transcription for CrowdCrafting/PyBossa

Daniel Lombraña González teleyinex at gmail.com
Mon Sep 23 07:40:55 UTC 2013


Hi,


On Fri, Sep 20, 2013 at 1:57 PM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:

> Many thanks Daniel
>

You are welcome!


>
> Anders Pedersen and I had a constructive discussion about tables and their
> taxonomy and I am looking at some of the ones he sent me. (As you know I am
> looking at how machines analyze tables - I think this is directly
> complementary to your application).
>

100% with you :-) The application/template I've created is for cases where
OCR is not enough, for example, for old documents where all you have is a
PDF with scanned images inside it and the cells are handwritten. The future
of these applications will be a mixture of an automated tool and the
verification by humans fixing errors and improving the algorithm (i.e.
feeding the humans output into the neural network to train it again).


>
> Many "tables" are not rectangular tables, but simply ways of laying out
> information using Excel or HTML tables. Common problems are nested tables,
> tables which concatenate tables, merged cells etc. Others simply chane
> semantics at random places.These are objectively difficult to describe! I'm
> not aware of a formal classification of this problem but it would be
> valuable. [https://en.wikipedia.org/wiki/Table_%28information%29 hints at
> the problem]
>

Yes, the problem is much bigger than simple tables! Probably with all our
efforts we will end up with a platform that will be very useful for almost
all scenarios :-)

Cheers,

Daniel

>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>



-- 
http://daniellombrana.es
http://citizencyberscience.net
http://www.shuttleworthfoundation.org/fellows/daniel-lombrana/
··························································································································································
Please do NOT use proprietary file formats to share files
like DOC or XLS, instead use PDF, HTML, RTF, TXT, CSV or
any other format that does not impose on the user the employment
of any specific software to work with the information inside the files.
··························································································································································
Por favor, NO utilice formatos de archivo propietarios para el
intercambio de documentos, como DOC y XLS, sino PDF, HTML, RTF, TXT, CSV
o cualquier otro que no obligue a utilizar un programa de un
fabricante concreto para tratar la información contenida en él.
··························································································································································
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130923/b24c8fd3/attachment-0001.html>


More information about the open-science mailing list