[okfn-discuss] OCR assistance with open shakespeare
Rufus Pollock
rufus.pollock at okfn.org
Tue Aug 14 13:36:02 UTC 2007
One of next things we want to do for open shakespeare is provide an open
introduction for to his works. The obvious idea for this was to use the
Shakespeare entry in the 11th ed of the Encyclopaedia Britannica as
detailed in this ticket:
http://p.knowledgeforge.net/shakespeare/trac/ticket/24
I've now written code to grab the relevant tiffs off wikimedia:
http://p.knowledgeforge.net/shakespeare/svn/trunk/src/shakespeare/src/eb.py
You can also find them online (28 pages) starting at:
http://upload.wikimedia.org/wikipedia/commons/scans/EB1911_tiff/VOL24%20SAINTE-CLAIRE%20DEVILLE-SHUTTLE/ED4A800.TIF
Next step is to then scan this stuff (after that we can move on to
proofing whether by ourselves or via http://pgdp.net). When I first had
a stab at this back in April I tried using gocr. Unfortunately the
results were so bad that they were unusable. Recently an old ocr engine
of HP's has been released as open source under the name of tesseract:
http://code.google.com/p/tesseract-ocr/
It looks like it might be better though I haven't had a chance to play
with it. I was wondering if there was anyone out there with some access
to a decent ocr system or had time to play with tesseract and who could
have a go at OCRing these TIFs?
~rufus
More information about the okfn-discuss
mailing list