[okfn-discuss] New Project Proposal: Online Full Texts of OED (1st edition) and Britannica 11th edition
Rufus Pollock
rufus.pollock at okfn.org
Thu Nov 1 16:58:14 UTC 2007
Working with the Shakespeare entry from 11th edition of the
Encyclopaedia Britannica over the last year, and particularly the
experience of using tesseract, has got me thinking about a couple of
potential projects along the same lines:
1. OCRing all of the EB 11th edition and putting it up online. If we put
this in something that was editable this might also be a quicker way to
do the proofing than the pgdp approach (which is currently working on
some of the earlier volumes but is proceeding fairly slowly). It is
interesting that some people have done this kind of thing already (see
examples at the end of the Wikipedia article on the EB 11th edition [1])
but all of them seem to be closed (i.e. claim copyright on the results).
2. Kragen Sitaker did amazing work back in 2005/2006 'liberating' the
OED first edition which is now (mostly) in the public domain [2]. He
posted up fairly good scans of volumes 1-6 on archive.org (see [2]).
However at the time he was unable to do much on the OCR front (no doubt
because of the poor performance of open source OCR, particularly on such
a complex text as the OED which has lots of non-standard english and
font changes). With the better open source OCR engine it would be
possible to convert the OED back into text and perhaps wikify it to
allow for gradual proof-editing and correction.
What do people think? Would this be something worth investigating
further? For example I don't yet know how well tesseract would work on
the OED text and this would obviously affect the value/cost trade-off.
~rufus
[1]:<http://en.wikipedia.org/wiki/Encyclop%C3%A6dia_Britannica_Eleventh_Edition>
[2]:<http://blog.okfn.org/2006/03/17/open-version-of-the-oed/>
<http://lists.canonical.org/pipermail/kragen-tol/2006-March/000816.html>
More information about the okfn-discuss
mailing list