[open-science] [School-of-data] PDF Extraction Tools (Michael Bauer)

Peter Murray-Rust pm286 at cam.ac.uk
Fri Dec 21 10:21:41 UTC 2012


I and collaborators are very actively working on this and making all
resources available under OKD-compliant terms. The project is code-named
"AMI2" and is initially targeted at scientific and technical publications.

We use the excellent Apache PDFBox http://pdfbox.apache.org/ to extract the
contents of the PDF and are layering tools on top of this. The end product
- RSN! - will be XHTML with domain-specific inserts for math, chemistry,
graphs and tables. The vision is to read millions of papers each year and
extract data automatically.

We would very much welcome a community effort to help with things like
character encodings, sample material , knowledge of fonts etc.
alpha-testing, etc. The work is blogged under #ami2 on my blog - see
http://blogs.ch.cam.ac.uk/pmr/2012/10/20/opencontentmining-the-ami2-project-for-understanding-the-scientific-literature-and-why-i-love-pdf/.

Be aware that "PDF" is not a simple concept. It can include:
* photographs of paper artifacts (challenging)
* scanned text (requires OCR)
* text with characters (common in modern PDF publications)
* text as vector graphics (i.e. no font info)
* bitmaps of tables, graphs, etc.
* vector graphics of graphs etc.

Where the document has vector graphics (which come from EPS files, PPT,
etc.) or characters the ease and quality of extraction is much higher than
for bitmaps. So whether any particular instance is feasible depends on
details.

Ross Mounce (OKF Panton fellow) and I are working very actively on this
area.


For the next School of Data tutorial I would like to cover Data extraction
>> from PDFs (and text) as well as OCR.
>>
>> Does anyone here have experience with Tools that do not require coding
>> skills to extract data and text from PDFs?
>>
>>
It's my intention to make this an automatic process - we have a little way
to go but it's months not years.

We'd be very interested in any who likes hacking this sort of thing.


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20121221/0700e60d/attachment.html>


More information about the open-science mailing list