New Project Proposal: Online Full Texts of OED (1st edition) and Britannica 11th edition

Rufus Pollock rufus.pollock at okfn.org
Thu Nov 1 16:58:14 UTC 2007

Working with the Shakespeare entry from 11th edition of the 
Encyclopaedia Britannica over the last year, and particularly the 
experience of using tesseract, has got me thinking about a couple of 
potential projects along the same lines:

1. OCRing all of the EB 11th edition and putting it up online. If we put 
this in something that was editable this might also be a quicker way to 
do the proofing than the pgdp approach (which is currently working on 
some of the earlier volumes but is proceeding fairly slowly). It is 
interesting that some people have done this kind of thing already (see 
examples at the end of the Wikipedia article on the EB 11th edition [1]) 
but all of them seem to be closed (i.e. claim copyright on the results).

2. Kragen Sitaker did amazing work back in 2005/2006 'liberating' the 
OED first edition which is now (mostly) in the public domain [2]. He 
posted up fairly good scans of volumes 1-6 on archive.org (see [2]). 
However at the time he was unable to do much on the OCR front (no doubt 
because of the poor performance of open source OCR, particularly on such 
a complex text as the OED which has lots of non-standard english and 
font changes). With the better open source OCR engine it would be 
possible to convert the OED back into text and perhaps wikify it to 
allow for gradual proof-editing and correction.

What do people think? Would this be something worth investigating 
further? For example I don't yet know how well tesseract would work on 
the OED text and this would obviously affect the value/cost trade-off.




