[okfn-labs] New PDF Table transcription for CrowdCrafting/PyBossa

Stefan Wehrmeyer stefan.wehrmeyer at okfn.org
Sat Sep 14 15:04:11 UTC 2013

Hi Daniel,

tabula has shown that transcribing simple PDF tables is really not necessary, if the table contains proper text (and it's not an image):

If the PDF has a table for an image, this makes much more sense. However, then I would still propose to do some pre-computation.
I wrote a library called carpenter that uses OpenCV to find rectangular structures in images and converts them to HTML tables. It then goes on to use tesseract to OCR the individual cells:

The table extraction and the OCR are both error prone and having something like crowdcrafting to check the results of each step would probably be in order.


On 13.09.2013, at 14:35 , Daniel Lombraña González <teleyinex at gmail.com> wrote:

> Hi there!
> Today I'm really happy to announce a new application/template for PyBossa that can be used in CrowdCrafting.org for transcribing tables locked in PDF files :-D
> The application is very similar to the PDF transcription one, as it is a new version of it, but showing how you can integrate a tabular data library to format the transcriptions easily.
> The application basically loads a PDF file (that can be hosted in your public Dropbox folder!) and asks you how many columns the table has in the page, if any. Then, if the answer is 5, a new table will be automatically created, adding new rows everything you complete one! Simple and clean!
> Each row is stored as a list in a JSON object, making really easy to parse it and export it to other formats.
> Here you have a short Youtube video showing the app: http://www.youtube.com/watch?v=yfnJHALzlZc
> The application: http://crowdcrafting.org/app/pdftabletranscribe/
> And the official Tweet: https://twitter.com/teleyinex/status/378474287532744704
> NOTE: this app works really well, when in each page there is only 1 table, and there are no cells joined. For other cases, the template should be adapted, this is just the minimum version to work with. The handsontable library is really awesome, so you can adapt it to your needs without problems.
> All the best,
> Daniel
> -- 
> http://daniellombrana.es
> http://citizencyberscience.net
> http://www.shuttleworthfoundation.org/fellows/daniel-lombrana/
> ··························································································································································
> Please do NOT use proprietary file formats to share files
> like DOC or XLS, instead use PDF, HTML, RTF, TXT, CSV or
> any other format that does not impose on the user the employment
> of any specific software to work with the information inside the files.
> ··························································································································································
> Por favor, NO utilice formatos de archivo propietarios para el
> intercambio de documentos, como DOC y XLS, sino PDF, HTML, RTF, TXT, CSV
> o cualquier otro que no obligue a utilizar un programa de un
> fabricante concreto para tratar la información contenida en él.
> ··························································································································································
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs

Stefan Wehrmeyer
Projektleiter FragDenStaat.de
stefan.wehrmeyer at okfn.org
+49 151 15550559
Open Knowledge Foundation Deutschland e.V.
Gneisenaustr. 52 
10961 Berlin

Spenden Sie für FragDenStaat.de:

More information about the okfn-labs mailing list