[open-science] Extracting and indexing information from scientific literature ("the PDF Cow")

Peter Murray-Rust pm286 at cam.ac.uk
Wed Apr 18 20:14:55 UTC 2012


On Wed, Apr 18, 2012 at 7:53 PM, Bryan Bishop <kanzure at gmail.com> wrote:

> On Wed, Apr 18, 2012 at 1:47 PM, Peter Murray-Rust <pm286 at cam.ac.uk>
> wrote:
> > Extracting information from general PDFs is impossible, and likened to
> > "converting a hamburger back to a cow" (I am sometimes credited with this
>
> I currently having a team that is aiming to index >99% of science,
> including PDFs. But naturally it's not very public at the moment. In
> general, the approach is to get metadata from the publishers because,
> frankly, OCR is magic. Tesseract doesn't work. OCR doesn't work. Why
> on earth would any OCR program think []><>*#!^&()__-- appears so often
> in English text? That's not right at all.
>
>
I don't understand this. Are you running a closed company? If so, good luck
but I am only interested in Open collaboration at this stage. This isn't
for moralistic reasons but because it is only by having open code that we
can make sufficient progress.

And indexing is more than publisher metadata (which actually can be
extracted by several means). It's about domain-specific searching - e.g. we
can search patents for chemistry in chemical language.

P.


> - Bryan
> http://heybryan.org/
> 1 512 203 0507
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20120418/8e99f551/attachment-0001.html>


More information about the open-science mailing list