[okfn-labs] New PDF Table transcription for CrowdCrafting/PyBossa

Daniel Lombraña González teleyinex at gmail.com
Fri Sep 20 09:26:17 UTC 2013


Hi Stefan,

On Sat, Sep 14, 2013 at 5:04 PM, Stefan Wehrmeyer <stefan.wehrmeyer at okfn.org
> wrote:

> Hi Daniel,
>
> tabula has shown that transcribing simple PDF tables is really not
> necessary, if the table contains proper text (and it's not an image):
> http://tabula.nerdpower.org/
>

Indeed! But you said it: proper text and it is not an image :-)


>
> If the PDF has a table for an image, this makes much more sense. However,
> then I would still propose to do some pre-computation.
> I wrote a library called carpenter that uses OpenCV to find rectangular
> structures in images and converts them to HTML tables. It then goes on to
> use tesseract to OCR the individual cells:
> https://github.com/stefanw/carpenter
>
> The table extraction and the OCR are both error prone and having something
> like crowdcrafting to check the results of each step would probably be in
> order.
>

Yes, that's what I basically said after my first e-mail, that the
application I've created can be actually populated that way. However, bear
in mind that for some specific groups the knowledge required (and IT
skills) to do all the tasks could be challenging. For these reasons, the
CrowdCrafting.org solution involve more or less few simple steps:

1.- Upload the PDF file to your own Dropbox Public folder
2.- Copy the public link of it
3.- Copy the PDF Google Spreadsheet doc template, and replace the URL and
number of pages that you want to transcribe with the new one
4.- Import it in your CrowdCrafting app and you are done!

But again, don't get me wrong, I totally agree with you. Using first an OCR
is much better :-)

Cheers,

Daniel

>
> Cheers
> Stefan
>
>
> On 13.09.2013, at 14:35 , Daniel Lombraña González <teleyinex at gmail.com>
> wrote:
>
> > Hi there!
> >
> > Today I'm really happy to announce a new application/template for
> PyBossa that can be used in CrowdCrafting.org for transcribing tables
> locked in PDF files :-D
> >
> > The application is very similar to the PDF transcription one, as it is a
> new version of it, but showing how you can integrate a tabular data library
> to format the transcriptions easily.
> >
> > The application basically loads a PDF file (that can be hosted in your
> public Dropbox folder!) and asks you how many columns the table has in the
> page, if any. Then, if the answer is 5, a new table will be automatically
> created, adding new rows everything you complete one! Simple and clean!
> >
> > Each row is stored as a list in a JSON object, making really easy to
> parse it and export it to other formats.
> >
> > Here you have a short Youtube video showing the app:
> http://www.youtube.com/watch?v=yfnJHALzlZc
> >
> > The application: http://crowdcrafting.org/app/pdftabletranscribe/
> >
> > And the official Tweet:
> https://twitter.com/teleyinex/status/378474287532744704
> >
> > NOTE: this app works really well, when in each page there is only 1
> table, and there are no cells joined. For other cases, the template should
> be adapted, this is just the minimum version to work with. The handsontable
> library is really awesome, so you can adapt it to your needs without
> problems.
> >
> > All the best,
> >
> > Daniel
> >
> > --
> > http://daniellombrana.es
> > http://citizencyberscience.net
> > http://www.shuttleworthfoundation.org/fellows/daniel-lombrana/
> >
> ··························································································································································
> > Please do NOT use proprietary file formats to share files
> > like DOC or XLS, instead use PDF, HTML, RTF, TXT, CSV or
> > any other format that does not impose on the user the employment
> > of any specific software to work with the information inside the files.
> >
> ··························································································································································
> > Por favor, NO utilice formatos de archivo propietarios para el
> > intercambio de documentos, como DOC y XLS, sino PDF, HTML, RTF, TXT, CSV
> > o cualquier otro que no obligue a utilizar un programa de un
> > fabricante concreto para tratar la información contenida en él.
> >
> ··························································································································································
> > _______________________________________________
> > okfn-labs mailing list
> > okfn-labs at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/okfn-labs
> > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
> --
> Stefan Wehrmeyer
> Projektleiter FragDenStaat.de
> stefan.wehrmeyer at okfn.org
> +49 151 15550559
> Open Knowledge Foundation Deutschland e.V.
> Gneisenaustr. 52
> 10961 Berlin
> http://www.okfn.de
>
> Spenden Sie für FragDenStaat.de:
> https://fragdenstaat.de/hilfe/spenden/
>
>
>
>


-- 
http://daniellombrana.es
http://citizencyberscience.net
http://www.shuttleworthfoundation.org/fellows/daniel-lombrana/
··························································································································································
Please do NOT use proprietary file formats to share files
like DOC or XLS, instead use PDF, HTML, RTF, TXT, CSV or
any other format that does not impose on the user the employment
of any specific software to work with the information inside the files.
··························································································································································
Por favor, NO utilice formatos de archivo propietarios para el
intercambio de documentos, como DOC y XLS, sino PDF, HTML, RTF, TXT, CSV
o cualquier otro que no obligue a utilizar un programa de un
fabricante concreto para tratar la información contenida en él.
··························································································································································
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130920/af1e8e9a/attachment-0002.html>


More information about the okfn-labs mailing list