[okfn-discuss] OCR assistance with open shakespeare
Rufus Pollock
rufus.pollock at okfn.org
Fri Aug 31 09:37:55 UTC 2007
Nate Olson wrote:
> Rufus,
>
> Have you had any feedback about this? Don't recall seeing any replies
> come across the list, though I could have missed something.
Jonathan has mentioned (off-list) that he could help with proof-reading
but you're the first onlist :)
Since I wrote the original mail I have done some experimentation of my
own with the result that I sorted out the OCRing using tesseract[^1] and
we now have a nice plain text version of the EB entry on shakespeare:
<http://knowledgeforge.net/shakespeare/svn/trunk/shksprdata/ancillary/britannica-11th.txt>
What we now need to do is 'proof' this to correct the OCR errors. *This
kind of think is perfect for distributed volunteers so if you'd like to
help out just step up and starting correcting with one of the sections*
(this is also the kind of thing that might be suitable for Summer of
Content ...).
To avoid multiple people working on the same section it is probably best
either to mail the list or mail me off-list (then I can act as a
coordinator). To commit your changes either sign up for a knowledgeforge
account so I give you commit access to subversion or just mail me the
changes and I'll merge them. Finally note that:
1. I have put a link to source tif at top of each section (each original
page has been divided into 2 in the plain text with _0 or _1 appended to
file name to indicate left and right columns)
2. From page 24 onwards is pretty much garbage. I think this is because
we are into the bibliography which has a non-standard layout but I'm not
entirely sure.
~rufus
[^1]: Specifically I downloaded and compiled tesseract and established
it was pretty promising -- performing far far better than gocr for
example. Unfortunately it does not yet deal with multi-column text so
the two columns in the EB were run together. However a bit of hacking
later and I had a script to automate chopping the pages up and running
them through tesseract:
[1]:http://knowledgeforge.net/shakespeare/svn/trunk/shakespeare/src/eb.py
More information about the okfn-discuss
mailing list