[School-of-data] [open-science] PDF Extraction Tools (Michael Bauer)

Fri Dec 21 11:18:19 UTC 2012

On Fri, Dec 21, 2012 at 11:14 AM, Andy Turner <A.G.D.Turner at leeds.ac.uk>wrote:

> Hi,****
>
> ** **
>
> There is a web service hosted at the University of Manchester for
> converting scientific articles from PDF to XML:****
>
> http://pdfx.cs.man.ac.uk/****
>
> It is based on a product called Utopia Documents or UtopiaDocs:****
>
> http://www.utopiadocs.com/****
>
> Maybe this uses Apache PDFBox, and maybe not.****
>
> **
>

I am familiar with this. It does a reasonable but not complete job. It's
run by Steve Pettifer

> **
>
> I copied in the contact for the web service and the info contact of
> UtopiaDocs.****
>
> ** **
>
> Best Wishes ****
>
> ** **
>
> Andy
> http://www.geog.leeds.ac.uk/people/a.turner/
>   ****
>
> *From:* open-science-bounces at lists.okfn.org [mailto:
> open-science-bounces at lists.okfn.org] *On Behalf Of *Peter Murray-Rust
> *Sent:* 21 December 2012 10:22
> *To:* Adam Stiles
> *Cc:* school-of-data at lists.okfn.org; open-science
> *Subject:* Re: [open-science] [School-of-data] PDF Extraction Tools
> (Michael Bauer)****
>
> ** **
>
> I and collaborators are very actively working on this and making all
> resources available under OKD-compliant terms. The project is code-named
> "AMI2" and is initially targeted at scientific and technical publications.
>
> We use the excellent Apache PDFBox http://pdfbox.apache.org/ to extract
> the contents of the PDF and are layering tools on top of this. The end
> product - RSN! - will be XHTML with domain-specific inserts for math,
> chemistry, graphs and tables. The vision is to read millions of papers each
> year and extract data automatically.
>
> We would very much welcome a community effort to help with things like
> character encodings, sample material , knowledge of fonts etc.
> alpha-testing, etc. The work is blogged under #ami2 on my blog - see
> http://blogs.ch.cam.ac.uk/pmr/2012/10/20/opencontentmining-the-ami2-project-for-understanding-the-scientific-literature-and-why-i-love-pdf/.
>
> Be aware that "PDF" is not a simple concept. It can include:
> * photographs of paper artifacts (challenging)
> * scanned text (requires OCR)
> * text with characters (common in modern PDF publications)
> * text as vector graphics (i.e. no font info)
> * bitmaps of tables, graphs, etc.
> * vector graphics of graphs etc.
>
> Where the document has vector graphics (which come from EPS files, PPT,
> etc.) or characters the ease and quality of extraction is much higher than
> for bitmaps. So whether any particular instance is feasible depends on
> details.
>
> Ross Mounce (OKF Panton fellow) and I are working very actively on this
> area.****
>
> ** **
>
> For the next School of Data tutorial I would like to cover Data extraction
> from PDFs (and text) as well as OCR.
>
> Does anyone here have experience with Tools that do not require coding
> skills to extract data and text from PDFs?****
>
>
> It's my intention to make this an automatic process - we have a little way
> to go but it's months not years.
>
> We'd be very interested in any who likes hacking this sort of thing. ****
>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069 ****
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20121221/84af6804/attachment-0001.html>