[open-science] software to extract text from pdf

Fri Jun 21 07:03:00 UTC 2013

Hi all,

In my personal case, I know some efforts to do the OCR with humans, due to
all the problems that all of you have already described ;-)

For the purpose of your problem: scanned images converted into PDFs, my
recommendation is to:

1.- Use an OCR software to create a first version of the scanned documents.
2.- Ask volunteers to improve the output of the OCR in a platform like
CrowdCrafting.org (take a look at PDF Transcribe app, it loads PDFs and you
can easily transcribe them).

There are several examples out there using a mix of the two previous steps.
For example:

1.- http://crowdcrafting.org/app/pdftranscribe/ Loads a PDF from a server,
and you ask the volunteer whatever you like: transcribe the figures, let me
know the structure of a table, etc. It is open source and can be used
directly in CrowdCrafting as it is powered by PyBossa (you need to host the
PDF files in a server with a simple configuration).
2.- http://www.oldweather.org/ This is a web tool from the Zoouniverse guys
who are transcribing docs thanks to the collaboration of hundreds of
volunteers ;-) I don't know if it is open source, but you can definitely
contact them.
3.- http://boinc.cs.uct.ac.za/transcribe_bushman/ This project is using the
BOSSA framework, the inspiration for PyBossa, for transcribing thanks to
volunteers really complex text from scanned images.
4.- https://www.zooniverse.org/project/ancientlives Similar to the Bushman
project, but with support from the Zooniverse team.
5.- Finally, http://www.pgdp.net/c/ a community of volunteers that actually
help to transcribe public domain books into electronic formats. I think
this is a great community and you can talk to them to get a bit more of
information about their techniques, solutions, etc.

There are probably more solutions out there, but these are the ones I know.
All of them have their pros and cons.

I can help you with PDF Transcribe solution, as I'm the developer of it, so
if you want to give it a try, let me know it.

All the best,

Daniel
-- 
··························································································································································
http://daniellombrana.es
http://citizencyberscience.net
··························································································································································
Por favor, NO utilice formatos de archivo propietarios para el
intercambio de documentos, como DOC y XLS, sino PDF, HTML, RTF, TXT, CSV
o cualquier otro que no obligue a utilizar un programa de un
fabricante concreto para tratar la información contenida en él.
··························································································································································
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130621/b76709f5/attachment-0001.html>