[open-science] fyi - Using OpenCV & Tesseract open source OCR for equation recognition
Peter Murray-Rust
pm286 at cam.ac.uk
Mon Feb 4 23:22:09 UTC 2013
On Mon, Feb 4, 2013 at 7:42 PM, Bryan Bishop <kanzure at gmail.com> wrote:
> On Mon, Feb 4, 2013 at 12:42 PM, Tom Morris <tfmorris at gmail.com> wrote:
>
>>
>> http://ayoungprogrammer.blogspot.ca/2013/01/part-3-making-ocr-for-equations.html
>>
>
> Is there an open source library (possibly using tesseract+opencv or gnu
> gift) that can help with extracting metadata from journal articles, or
> bibliographic items? I would rather look at something that already exists
> instead of writing it on my own (something I see myself doing eventually).
Distinguish between born-digital docs ("PDFs") and scanned documents. The
former - which are almost universal since ca 2003 are much easier.
If you have these then we can reconstruct the information using the Open
PDF2SVG (http://bitbucket.org/petermr/pdf2svg). This is being developed to
extract paragraphs, metadata, etc. (SVGPlus). the third phase is to extract
chemistry, maths, etc. I show a chemical example in
http://blogs.ch.cam.ac.uk/pmr/2013/02/03/announce-we-ami-can-now-extract-semantic-information-from-scientific-pdfs/where
we extract a spectrum. Maths will be analogous.
Any volunteers would be very valuable. You don't have to be able to be a
Java programmer because some of the challenge is hacking Fonts... and the
disciplines. It's about to take off - join in.
P.
> - Bryan
> http://heybryan.org/
> 1 512 203 0507
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>
>
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20130204/281e50a0/attachment-0001.html>
More information about the open-science
mailing list