[humanities-dev] OCRing text

iain emsley iain_emsley at austgate.co.uk
Sun Feb 19 12:22:38 UTC 2012


Hi Pedro, 

I'd not come across that link (but I know the DIY bookscanner project
which inspired it quite well since I had the same source). 

The one thing that strikes me is that it appears to command line based
(which could be hidden for the casual user to use it as a resource
behind a micro-framework like Flask) and it outputs to the file system
rather than to a database. 

I'm interested in trying to build something that could be used if the
user was mobile (so with a small camera as Homer does or also a mobile
phone) so it would need a web interface. 

I suppose the best thing to do is to tidy up my existing code and figure
out the changes to a forked Bibserver front-end and then write a blog
post about it :). 

The longest part will be playing with Tesseract / OCRopus to find out
which is the best and what ocropus will provide in terms of image /
language handling. 


Iain
On Sat, 2012-02-18 at 20:34 -0200, Pedro Markun wrote:
> Also, on booking scanning, another member of the Transparency Hacker
> community tried this one:
> 
> http://bookscanner.pbworks.com/w/page/40965440/FrontPage
> 
> Said it works quite well (for the price and assembly-time).
> 
> We're also discussing on building a machine to digitize law process
> (they have a prohibitive cost to copy around here and are ~7000 pages
> wide). The idea - wich is beein discussed with the guys from
> MetaMaquina (a hacker startup focused on 3d printers and printing) so
> the idea is to make it portable and light using some printed parts.
> 
> (Still trying to get funds for this, thought)
> 
> []'s
> Pedro Markun
> 
> On Sat, Feb 18, 2012 at 8:28 PM, Pedro Markun <pedro at esfera.mobi>
> wrote:
>         Hey all,
>         
>         I'm trying to accomplish something quite similar around here
>         to get some (a lot!) of congressman speeches prior to ~98
>         which are not digitized around here.
>         
>         Actually adapted the code from the OKFN DataDigitizer demo:
>         http://blog.okfn.org/2011/11/17/introducing-the-data-digitizer/
>         to be used as a ocring corrector tool.
>         
>         Realli rough on the edges, but it's working. There's also a
>         python scraper I wrote to actually get the images and
>         tesseract them.
>         
>         I'm now trying to create a training set, but still researching
>         on proper fonts to emulate the old typewritter they used in
>         the speeches records.
>         
>         (Everything is in a real slowmotion cuz I just had a baby. But
>         will follow the thread :) )
>         
>         []'s
>         Pedro Markun
>         
>         
>         On Sat, Feb 18, 2012 at 9:06 AM, iain emsley
>         <iain_emsley at austgate.co.uk> wrote:
>                 Morning,
>                 
>                 Having a catch up day at the moment. At last year's
>                 Textcamp, we
>                 discussed building a bookscanner in both hardware and
>                 software. The
>                 project's gone quiet for a bit but I've been thinking
>                 about in the last
>                 few days, post conversation with Mark at Dev8d about
>                 Bibserver.
>                 
>                 Looking on the Textus wiki, there is an action for an
>                 OCR to digitise
>                 books.
>                 
>                 I've written some test code here using Flask and the
>                 Tesseract OCR
>                 library (http://code.google.com/p/tesseract-ocr/)
>                 which has produced
>                 some, er, interesting results (some training wheels
>                 required!) but it is
>                 the basics of something.
>                 
>                 I had thought about forking Bibserver as the base
>                 project and writing
>                 something which allows an upload from web or file
>                 system (if one was to
>                 attach a camera or homebrew scanner to a laptop) and
>                 to load the image
>                 and get it parsed by Tesseract before being stored. I
>                 guess the next
>                 step would be to use something like sharejs to allow
>                 communal editing of
>                 the parsed text. I suppose something like Nodejs could
>                 be used as a
>                 listener on a file system to load any new images into
>                 the system but
>                 perhaps that is a little down the line. I know that
>                 there are other
>                 considerations (like getting the data, distributed
>                 proof-reading, and so
>                 on) but is this worth beginning to build some working
>                 code for?
>                 
>                 I can see there is conversation about it but was
>                 wondering where it was
>                 as I have a personal need for something like this and
>                 know some people
>                 who are in interested in the general project. I'd be
>                 interested in
>                 putting something together and posting it to my github
>                 account. I would
>                 however like to avoid treading on anyone's toes on
>                 this.
>                 
>                 I am also interested in building the book scanner as
>                 well. I know that
>                 the Instructables site has built one or two but there
>                 does also seem to
>                 be a pattern (from news reports about the British
>                 Library archives) for
>                 one which scans newspapers and larger items and is
>                 more of a flat bed
>                 scanner. I'm just curious about the hardware as well
>                 and perhaps coming
>                 up with something that can be shared to allow anyone
>                 else to build their
>                 own.
>                 
>                 Just a thought but is it something that can be fleshed
>                 out and made
>                 useful to the dev list?
>                 
>                 Thanks.
>                 
>                 Iain
>                 
>                 
>                 _______________________________________________
>                 humanities-dev mailing list
>                 humanities-dev at lists.okfn.org
>                 http://lists.okfn.org/mailman/listinfo/humanities-dev
>         
>         
> 






More information about the humanities-dev mailing list