[open-science] [School-of-data] PDF Extraction Tools (Michael Bauer)

Fri Dec 21 11:14:24 UTC 2012

Hi,

There is a web service hosted at the University of Manchester for converting scientific articles from PDF to XML:
http://pdfx.cs.man.ac.uk/
It is based on a product called Utopia Documents or UtopiaDocs:
http://www.utopiadocs.com/
Maybe this uses Apache PDFBox, and maybe not.

I copied in the contact for the web service and the info contact of UtopiaDocs.

Best Wishes

Andy
http://www.geog.leeds.ac.uk/people/a.turner/

From: open-science-bounces at lists.okfn.org [mailto:open-science-bounces at lists.okfn.org] On Behalf Of Peter Murray-Rust
Sent: 21 December 2012 10:22
To: Adam Stiles
Cc: school-of-data at lists.okfn.org; open-science
Subject: Re: [open-science] [School-of-data] PDF Extraction Tools (Michael Bauer)

I and collaborators are very actively working on this and making all resources available under OKD-compliant terms. The project is code-named "AMI2" and is initially targeted at scientific and technical publications.

We use the excellent Apache PDFBox http://pdfbox.apache.org/ to extract the contents of the PDF and are layering tools on top of this. The end product - RSN! - will be XHTML with domain-specific inserts for math, chemistry, graphs and tables. The vision is to read millions of papers each year and extract data automatically.

We would very much welcome a community effort to help with things like character encodings, sample material , knowledge of fonts etc. alpha-testing, etc. The work is blogged under #ami2 on my blog - see http://blogs.ch.cam.ac.uk/pmr/2012/10/20/opencontentmining-the-ami2-project-for-understanding-the-scientific-literature-and-why-i-love-pdf/ .

Be aware that "PDF" is not a simple concept. It can include:
* photographs of paper artifacts (challenging)
* scanned text (requires OCR)
* text with characters (common in modern PDF publications)
* text as vector graphics (i.e. no font info)
* bitmaps of tables, graphs, etc.
* vector graphics of graphs etc.

Where the document has vector graphics (which come from EPS files, PPT, etc.) or characters the ease and quality of extraction is much higher than for bitmaps. So whether any particular instance is feasible depends on details.

Ross Mounce (OKF Panton fellow) and I are working very actively on this area.

For the next School of Data tutorial I would like to cover Data extraction
from PDFs (and text) as well as OCR.

Does anyone here have experience with Tools that do not require coding
skills to extract data and text from PDFs?

It's my intention to make this an automatic process - we have a little way to go but it's months not years.

We'd be very interested in any who likes hacking this sort of thing.

--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20121221/d2c9f811/attachment-0001.html>