[humanities-dev] OCRing text

iain emsley iain_emsley at austgate.co.uk
Sat Feb 18 11:06:49 UTC 2012


Morning, 

Having a catch up day at the moment. At last year's Textcamp, we
discussed building a bookscanner in both hardware and software. The
project's gone quiet for a bit but I've been thinking about in the last
few days, post conversation with Mark at Dev8d about Bibserver.

Looking on the Textus wiki, there is an action for an OCR to digitise
books. 

I've written some test code here using Flask and the Tesseract OCR
library (http://code.google.com/p/tesseract-ocr/) which has produced
some, er, interesting results (some training wheels required!) but it is
the basics of something. 

I had thought about forking Bibserver as the base project and writing
something which allows an upload from web or file system (if one was to
attach a camera or homebrew scanner to a laptop) and to load the image
and get it parsed by Tesseract before being stored. I guess the next
step would be to use something like sharejs to allow communal editing of
the parsed text. I suppose something like Nodejs could be used as a
listener on a file system to load any new images into the system but
perhaps that is a little down the line. I know that there are other
considerations (like getting the data, distributed proof-reading, and so
on) but is this worth beginning to build some working code for?

I can see there is conversation about it but was wondering where it was
as I have a personal need for something like this and know some people
who are in interested in the general project. I'd be interested in
putting something together and posting it to my github account. I would
however like to avoid treading on anyone's toes on this. 

I am also interested in building the book scanner as well. I know that
the Instructables site has built one or two but there does also seem to
be a pattern (from news reports about the British Library archives) for
one which scans newspapers and larger items and is more of a flat bed
scanner. I'm just curious about the hardware as well and perhaps coming
up with something that can be shared to allow anyone else to build their
own. 

Just a thought but is it something that can be fleshed out and made
useful to the dev list?

Thanks. 

Iain





More information about the humanities-dev mailing list