[open-science] Extracting and indexing information from scientific literature ("the PDF Cow")

Wed Apr 18 19:10:08 UTC 2012

peter/all, a friend showed me PDF.js the other day, and i was quite blown
away. a library for converting and rendering pdfs as html5 canvas elements,
complete with fancy fonts and everything. AFAIU it's still under
development, but here's the site:
http://andreasgal.com/2011/06/15/pdf-js/

and here's their demo:
http://mozilla.github.com/pdf.js/web/viewer.html

similarly, another friend, stian hacklev, has been looking into similarity
hashing for research papers.
http://reganmian.net/wiki/fuzzy_text_matching?
http://moultano.wordpress.com/article/simple-simhashing-3kbzhsxyg4467-6/

it seems like there's a lot of interest in this topic but many disparate
efforts. with a bit more coordination, i bet some great momentum could be
generated.

On Wed, Apr 18, 2012 at 11:53 AM, Bryan Bishop <kanzure at gmail.com> wrote:

> On Wed, Apr 18, 2012 at 1:47 PM, Peter Murray-Rust <pm286 at cam.ac.uk>
> wrote:
> > Extracting information from general PDFs is impossible, and likened to
> > "converting a hamburger back to a cow" (I am sometimes credited with this
> > aphorism but I didn't create it). A generic PDF may be a bitmap, contain
> > only vector strokes, and may have "order backwards in words". However for
> > scientific publications which are largely mechanised there is quite a lot
> > that can be done.
> >
> > A lot of people have thrown themselves at this, and it's a time sink.
> > However the technology is gradually getting better and I am reasonably
> > confident that certain information can be fairly well extracted. For
> example
> > it is possible to extract chemical structures from certain types of
> images
> > and also graphs and spectra.
> >
> > Many of the previous efforts have either ended up lost or incorporated in
> > closed programs. I am wondering if there is a critical mass of people who
> > are sufficiently interested that we can collate resources and experience
> in
> > this area. Because otherwise everyone ends up reinventing it.
>
> I currently having a team that is aiming to index >99% of science,
> including PDFs. But naturally it's not very public at the moment. In
> general, the approach is to get metadata from the publishers because,
> frankly, OCR is magic. Tesseract doesn't work. OCR doesn't work. Why
> on earth would any OCR program think []><>*#!^&()__-- appears so often
> in English text? That's not right at all.
>
> - Bryan
> http://heybryan.org/
> 1 512 203 0507
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
>

-- 
Jessy
http://jessykate.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20120418/115e8923/attachment-0001.html>