[humanities-dev] OCRing text

Pedro Markun pedro at esfera.mobi
Sat Feb 18 22:28:10 UTC 2012


Hey all,

I'm trying to accomplish something quite similar around here to get some (a
lot!) of congressman speeches prior to ~98 which are not digitized around
here.

Actually adapted the code from the OKFN DataDigitizer demo:
http://blog.okfn.org/2011/11/17/introducing-the-data-digitizer/
to be used as a ocring corrector tool.

Realli rough on the edges, but it's working. There's also a python scraper
I wrote to actually get the images and tesseract them.

I'm now trying to create a training set, but still researching on proper
fonts to emulate the old typewritter they used in the speeches records.

(Everything is in a real slowmotion cuz I just had a baby. But will follow
the thread :) )

[]'s
Pedro Markun

On Sat, Feb 18, 2012 at 9:06 AM, iain emsley <iain_emsley at austgate.co.uk>wrote:

> Morning,
>
> Having a catch up day at the moment. At last year's Textcamp, we
> discussed building a bookscanner in both hardware and software. The
> project's gone quiet for a bit but I've been thinking about in the last
> few days, post conversation with Mark at Dev8d about Bibserver.
>
> Looking on the Textus wiki, there is an action for an OCR to digitise
> books.
>
> I've written some test code here using Flask and the Tesseract OCR
> library (http://code.google.com/p/tesseract-ocr/) which has produced
> some, er, interesting results (some training wheels required!) but it is
> the basics of something.
>
> I had thought about forking Bibserver as the base project and writing
> something which allows an upload from web or file system (if one was to
> attach a camera or homebrew scanner to a laptop) and to load the image
> and get it parsed by Tesseract before being stored. I guess the next
> step would be to use something like sharejs to allow communal editing of
> the parsed text. I suppose something like Nodejs could be used as a
> listener on a file system to load any new images into the system but
> perhaps that is a little down the line. I know that there are other
> considerations (like getting the data, distributed proof-reading, and so
> on) but is this worth beginning to build some working code for?
>
> I can see there is conversation about it but was wondering where it was
> as I have a personal need for something like this and know some people
> who are in interested in the general project. I'd be interested in
> putting something together and posting it to my github account. I would
> however like to avoid treading on anyone's toes on this.
>
> I am also interested in building the book scanner as well. I know that
> the Instructables site has built one or two but there does also seem to
> be a pattern (from news reports about the British Library archives) for
> one which scans newspapers and larger items and is more of a flat bed
> scanner. I'm just curious about the hardware as well and perhaps coming
> up with something that can be shared to allow anyone else to build their
> own.
>
> Just a thought but is it something that can be fleshed out and made
> useful to the dev list?
>
> Thanks.
>
> Iain
>
>
> _______________________________________________
> humanities-dev mailing list
> humanities-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/humanities-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/humanities-dev/attachments/20120218/6f9ae66e/attachment.html>


More information about the humanities-dev mailing list