[School-of-data] [open-science] PDF Extraction Tools (Michael Bauer)

Daniel Lombraña González teleyinex at gmail.com
Fri Dec 21 10:51:00 UTC 2012

Hi there,

You can have a look to these other libraries:

http://mozilla.github.com/pdf.js/ <- This one is used in this PyBossa app
for transcribing PDFs that will need some humans, like for example scanned
images, vectorial ones, etc. Check the demo app here



On Fri, Dec 21, 2012 at 11:21 AM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:

> I and collaborators are very actively working on this and making all
> resources available under OKD-compliant terms. The project is code-named
> "AMI2" and is initially targeted at scientific and technical publications.
> We use the excellent Apache PDFBox http://pdfbox.apache.org/ to extract
> the contents of the PDF and are layering tools on top of this. The end
> product - RSN! - will be XHTML with domain-specific inserts for math,
> chemistry, graphs and tables. The vision is to read millions of papers each
> year and extract data automatically.
> We would very much welcome a community effort to help with things like
> character encodings, sample material , knowledge of fonts etc.
> alpha-testing, etc. The work is blogged under #ami2 on my blog - see
> http://blogs.ch.cam.ac.uk/pmr/2012/10/20/opencontentmining-the-ami2-project-for-understanding-the-scientific-literature-and-why-i-love-pdf/.
> Be aware that "PDF" is not a simple concept. It can include:
> * photographs of paper artifacts (challenging)
> * scanned text (requires OCR)
> * text with characters (common in modern PDF publications)
> * text as vector graphics (i.e. no font info)
> * bitmaps of tables, graphs, etc.
> * vector graphics of graphs etc.
> Where the document has vector graphics (which come from EPS files, PPT,
> etc.) or characters the ease and quality of extraction is much higher than
> for bitmaps. So whether any particular instance is feasible depends on
> details.
> Ross Mounce (OKF Panton fellow) and I are working very actively on this
> area.
>  For the next School of Data tutorial I would like to cover Data
>>> extraction
>>> from PDFs (and text) as well as OCR.
>>> Does anyone here have experience with Tools that do not require coding
>>> skills to extract data and text from PDFs?
> It's my intention to make this an automatic process - we have a little way
> to go but it's months not years.
> We'd be very interested in any who likes hacking this sort of thing.
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
> Unsubscribe: http://lists.okfn.org/mailman/options/open-science

Por favor, NO utilice formatos de archivo propietarios para el
intercambio de documentos, como DOC y XLS, sino PDF, HTML, RTF, TXT, CSV
o cualquier otro que no obligue a utilizar un programa de un
fabricante concreto para tratar la información contenida en él.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20121221/07bce0c9/attachment-0001.html>

More information about the school-of-data mailing list