[okfn-discuss] Suggestions for how to convert scans to text (with open source OCR tools)
Rufus Pollock
rufus.pollock at okfn.org
Mon Apr 23 10:01:22 UTC 2007
As discussed a little previously[1] part of the Open Shakespeare project
involves getting hold of introductory materials on Shakespeare's life
and works. In particular we're looking to use information on Shakespeare
from the 11th Edition of Encyclopaedia Britannica (which is now out of
copyright):
http://p.knowledgeforge.net/shakespeare/trac/ticket/24
As detailed there we are fortunate in that someone has already scanned a
full copy of EB 11th edition and put them on wikisource. I've now
written some code to automate grabbing the tiffs off wiki:
<http://p.knowledgeforge.net/shakespeare/svn/trunk/src/shakespeare/src/eb.py>
So the next step is to do OCR ( the resulting tiffs back into usable
text. I was wondering whether there was anyone on list with suggestions
(and experience) on how to go about doing this.
Regards,
Rufus
[1]: http://lists.okfn.org/pipermail/okfn-discuss/2007-March/000375.html
More information about the okfn-discuss
mailing list