[okfn-discuss] OCR assistance with open shakespeare

Rufus Pollock rufus.pollock at okfn.org
Fri Aug 31 09:37:55 UTC 2007

Nate Olson wrote:
> Rufus,
> Have you had any feedback about this? Don't recall seeing any replies  
> come across the list, though I could have missed something.

Jonathan has mentioned (off-list) that he could help with proof-reading 
but you're the first onlist :)

Since I wrote the original mail I have done some experimentation of my 
own with the result that I sorted out the OCRing using tesseract[^1] and 
we now have a nice plain text version of the EB entry on shakespeare:


What we now need to do is 'proof' this to correct the OCR errors. *This 
kind of think is perfect for distributed volunteers so if you'd like to 
help out just step up and starting correcting with one of the sections* 
(this is also the kind of thing that might be suitable for Summer of 
Content ...).

To avoid multiple people working on the same section it is probably best 
either to mail the list or mail me off-list (then I can act as a 
coordinator). To commit your changes either sign up for a knowledgeforge 
account so I give you commit access to subversion or just mail me the 
changes and I'll merge them. Finally note that:

1. I have put a link to source tif at top of each section (each original 
page has been divided into 2 in the plain text with _0 or _1 appended to 
file name to indicate left and right columns)

2. From page 24 onwards is pretty much garbage. I think this is because 
we are into the bibliography which has a non-standard layout but I'm not 
entirely sure.


[^1]: Specifically I downloaded and compiled tesseract and established 
it was pretty promising -- performing far far better than gocr for 
example. Unfortunately it does not yet deal with multi-column text so 
the two columns in the EB were run together. However a bit of hacking 
later and I had a script to automate chopping the pages up and running 
them through tesseract:


More information about the okfn-discuss mailing list