[okfn-discuss] Suggestions for how to convert scans to text (with open source OCR tools)

Mon Apr 23 10:01:22 UTC 2007

As discussed a little previously[1] part of the Open Shakespeare project 
involves getting hold of introductory materials on Shakespeare's life 
and works. In particular we're looking to use information on Shakespeare 
from the 11th Edition of Encyclopaedia Britannica (which is now out of 
copyright):

   http://p.knowledgeforge.net/shakespeare/trac/ticket/24

As detailed there we are fortunate in that someone has already scanned a 
full copy of EB 11th edition and put them on wikisource. I've now 
written some code to automate grabbing the tiffs off wiki:

<http://p.knowledgeforge.net/shakespeare/svn/trunk/src/shakespeare/src/eb.py>

So the next step is to do OCR ( the resulting tiffs back into usable 
text. I was wondering whether there was anyone on list with suggestions 
(and experience) on how to go about doing this.

Regards,

Rufus

[1]: http://lists.okfn.org/pipermail/okfn-discuss/2007-March/000375.html