[okfn-discuss] Mendely: beyond PDF, annotations in general, other thoughts (some OT)

Michael Fourman Michael.Fourman at ed.ac.uk
Fri Jan 1 13:58:37 UTC 2010


One, as yet undocumented, iDEA lab project at Edinburgh is to generate topic indexes for browsing relatively large collections (currently several thousands, planning for 10x - 100x that) of academic papers. 

(See http://homepages.inf.ed.ac.uk/mfourman/research/topics/uoe.xml for an early test example. Best viewed with a WebKit browser [Safari, Chrome], but also with latest Firefox [with some UI features missing].)

We're mining online pdf texts, and find that around one third of the pdfs that academics at Edinburgh publish online don't easily yield text.

I have slightly different needs from someone wanting a text version for annotation (I just need a bag of words). I'm resorting to OCR, using a combination of convert (ImageMagick), tesseract (code.google.com/p/tesseract-ocr/), aspell, and a stemmer to produce the bag of words I need.
 
The ocropus project, which also builds on tesseract, may be closer to what you want. (code.google.com/p/ocropus/)

VelOCRaptor (http://blog.velocraptor.com/) provides an OSX tool (not open, but based on ocropus) for using ocr to add searchable text to pdfs.

It would be good to establish an open version of something similar, together with tools for manual correction, and learning from manual corrections to improve automation. I plan to propose an MSc project along these lines.

With best wishes for the New Year,

Michael

On 1 Jan 2010, at 12:00, okfn-discuss-request at lists.okfn.org wrote:

> On Fri, Dec 4, 2009 at 9:44 AM, Philippe Aigrain
> <philippe.aigrain at sopinspace.com> wrote:
>> Does not fit your imemdiate needs of annotating PDF, but in our new version
>> of the co-ment annotation system, we took a strong orientation of using
>> simple structured text formats such as markdown. For PDFs containing text,
>> it is relatively easy to go PDF to markdown. Of course for PDF containing
>> images of texts, this is another story.
>> 
>> See www.co-ment.net for existing co-ment
>> www.co-ment.org for future version
> 

Professor Michael Fourman FBCS CITP
Director, iDEA lab
Informatics Forum
10 Crichton Street
Edinburgh
EH8 9AB 
http://idea.ed.ac.uk/
For diary appointments contact :
mdunlop2(at)ed-dot-ac-dot-uk
+44 131 650 2690

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.





More information about the okfn-discuss mailing list