[School-of-data] [open-science] PDF Extraction Tools (Michael Bauer)

Fri Dec 21 10:51:00 UTC 2012

Hi there,

You can have a look to these other libraries:

http://poppler.freedesktop.org/
http://mozilla.github.com/pdf.js/ <- This one is used in this PyBossa app
for transcribing PDFs that will need some humans, like for example scanned
images, vectorial ones, etc. Check the demo app here
http://crowdcrafting.org/app/pdftranscribe
http://www.gnupdf.org/Main_Page

Cheers,

Daniel

On Fri, Dec 21, 2012 at 11:21 AM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:

> I and collaborators are very actively working on this and making all
> resources available under OKD-compliant terms. The project is code-named
> "AMI2" and is initially targeted at scientific and technical publications.
>
> We use the excellent Apache PDFBox http://pdfbox.apache.org/ to extract
> the contents of the PDF and are layering tools on top of this. The end
> product - RSN! - will be XHTML with domain-specific inserts for math,
> chemistry, graphs and tables. The vision is to read millions of papers each
> year and extract data automatically.
>
> We would very much welcome a community effort to help with things like
> character encodings, sample material , knowledge of fonts etc.
> alpha-testing, etc. The work is blogged under #ami2 on my blog - see
> http://blogs.ch.cam.ac.uk/pmr/2012/10/20/opencontentmining-the-ami2-project-for-understanding-the-scientific-literature-and-why-i-love-pdf/.
>
> Be aware that "PDF" is not a simple concept. It can include:
> * photographs of paper artifacts (challenging)
> * scanned text (requires OCR)
> * text with characters (common in modern PDF publications)
> * text as vector graphics (i.e. no font info)
> * bitmaps of tables, graphs, etc.
> * vector graphics of graphs etc.
>
> Where the document has vector graphics (which come from EPS files, PPT,
> etc.) or characters the ease and quality of extraction is much higher than
> for bitmaps. So whether any particular instance is feasible depends on
> details.
>
> Ross Mounce (OKF Panton fellow) and I are working very actively on this
> area.
>
>
>  For the next School of Data tutorial I would like to cover Data
>>> extraction
>>> from PDFs (and text) as well as OCR.
>>>
>>> Does anyone here have experience with Tools that do not require coding
>>> skills to extract data and text from PDFs?
>>>
>>>
> It's my intention to make this an automatic process - we have a little way
> to go but it's months not years.
>
> We'd be very interested in any who likes hacking this sort of thing.
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>
>

-- 
··························································································································································
http://daniellombrana.es
http://www.flickr.com/photos/teleyinex
··························································································································································
Por favor, NO utilice formatos de archivo propietarios para el
intercambio de documentos, como DOC y XLS, sino PDF, HTML, RTF, TXT, CSV
o cualquier otro que no obligue a utilizar un programa de un
fabricante concreto para tratar la información contenida en él.
··························································································································································
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20121221/07bce0c9/attachment-0001.html>