[okfn-discuss] OCR assistance with open shakespeare

Rufus Pollock rufus.pollock at okfn.org
Tue Aug 14 13:36:02 UTC 2007


One of next things we want to do for open shakespeare is provide an open 
introduction for to his works. The obvious idea for this was to use the 
Shakespeare entry in the 11th ed of the Encyclopaedia Britannica as 
detailed in this ticket:

http://p.knowledgeforge.net/shakespeare/trac/ticket/24

I've now written code to grab the relevant tiffs off wikimedia:

http://p.knowledgeforge.net/shakespeare/svn/trunk/src/shakespeare/src/eb.py

You can also find them online (28 pages) starting at:

http://upload.wikimedia.org/wikipedia/commons/scans/EB1911_tiff/VOL24%20SAINTE-CLAIRE%20DEVILLE-SHUTTLE/ED4A800.TIF

Next step is to then scan this stuff (after that we can move on to 
proofing whether by ourselves or via http://pgdp.net). When I first had 
a stab at this back in April I tried using gocr. Unfortunately the 
results were so bad that they were unusable. Recently an old ocr engine 
of HP's has been released as open source under the name of tesseract:

   http://code.google.com/p/tesseract-ocr/

It looks like it might be better though I haven't had a chance to play 
with it. I was wondering if there was anyone out there with some access 
to a decent ocr system or had time to play with tesseract and who could 
have a go at OCRing these TIFs?

~rufus




More information about the okfn-discuss mailing list