[humanities-dev] OCRing text

Sat Feb 18 22:34:49 UTC 2012

Also, on booking scanning, another member of the Transparency Hacker
community tried this one:

http://bookscanner.pbworks.com/w/page/40965440/FrontPage

Said it works quite well (for the price and assembly-time).

We're also discussing on building a machine to digitize law process (they
have a prohibitive cost to copy around here and are ~7000 pages wide). The
idea - wich is beein discussed with the guys from MetaMaquina (a hacker
startup focused on 3d printers and printing) so the idea is to make it
portable and light using some printed parts.

(Still trying to get funds for this, thought)

[]'s
Pedro Markun

On Sat, Feb 18, 2012 at 8:28 PM, Pedro Markun <pedro at esfera.mobi> wrote:

> Hey all,
>
> I'm trying to accomplish something quite similar around here to get some
> (a lot!) of congressman speeches prior to ~98 which are not digitized
> around here.
>
> Actually adapted the code from the OKFN DataDigitizer demo:
> http://blog.okfn.org/2011/11/17/introducing-the-data-digitizer/
> to be used as a ocring corrector tool.
>
> Realli rough on the edges, but it's working. There's also a python scraper
> I wrote to actually get the images and tesseract them.
>
> I'm now trying to create a training set, but still researching on proper
> fonts to emulate the old typewritter they used in the speeches records.
>
> (Everything is in a real slowmotion cuz I just had a baby. But will follow
> the thread :) )
>
> []'s
> Pedro Markun
>
>
> On Sat, Feb 18, 2012 at 9:06 AM, iain emsley <iain_emsley at austgate.co.uk>wrote:
>
>> Morning,
>>
>> Having a catch up day at the moment. At last year's Textcamp, we
>> discussed building a bookscanner in both hardware and software. The
>> project's gone quiet for a bit but I've been thinking about in the last
>> few days, post conversation with Mark at Dev8d about Bibserver.
>>
>> Looking on the Textus wiki, there is an action for an OCR to digitise
>> books.
>>
>> I've written some test code here using Flask and the Tesseract OCR
>> library (http://code.google.com/p/tesseract-ocr/) which has produced
>> some, er, interesting results (some training wheels required!) but it is
>> the basics of something.
>>
>> I had thought about forking Bibserver as the base project and writing
>> something which allows an upload from web or file system (if one was to
>> attach a camera or homebrew scanner to a laptop) and to load the image
>> and get it parsed by Tesseract before being stored. I guess the next
>> step would be to use something like sharejs to allow communal editing of
>> the parsed text. I suppose something like Nodejs could be used as a
>> listener on a file system to load any new images into the system but
>> perhaps that is a little down the line. I know that there are other
>> considerations (like getting the data, distributed proof-reading, and so
>> on) but is this worth beginning to build some working code for?
>>
>> I can see there is conversation about it but was wondering where it was
>> as I have a personal need for something like this and know some people
>> who are in interested in the general project. I'd be interested in
>> putting something together and posting it to my github account. I would
>> however like to avoid treading on anyone's toes on this.
>>
>> I am also interested in building the book scanner as well. I know that
>> the Instructables site has built one or two but there does also seem to
>> be a pattern (from news reports about the British Library archives) for
>> one which scans newspapers and larger items and is more of a flat bed
>> scanner. I'm just curious about the hardware as well and perhaps coming
>> up with something that can be shared to allow anyone else to build their
>> own.
>>
>> Just a thought but is it something that can be fleshed out and made
>> useful to the dev list?
>>
>> Thanks.
>>
>> Iain
>>
>>
>> _______________________________________________
>> humanities-dev mailing list
>> humanities-dev at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/humanities-dev
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/humanities-dev/attachments/20120218/05ef44d1/attachment.html>