[humanities-dev] OCRing text
iain emsley
iain_emsley at austgate.co.uk
Sun Feb 19 12:22:38 UTC 2012
Hi Pedro,
I'd not come across that link (but I know the DIY bookscanner project
which inspired it quite well since I had the same source).
The one thing that strikes me is that it appears to command line based
(which could be hidden for the casual user to use it as a resource
behind a micro-framework like Flask) and it outputs to the file system
rather than to a database.
I'm interested in trying to build something that could be used if the
user was mobile (so with a small camera as Homer does or also a mobile
phone) so it would need a web interface.
I suppose the best thing to do is to tidy up my existing code and figure
out the changes to a forked Bibserver front-end and then write a blog
post about it :).
The longest part will be playing with Tesseract / OCRopus to find out
which is the best and what ocropus will provide in terms of image /
language handling.
Iain
On Sat, 2012-02-18 at 20:34 -0200, Pedro Markun wrote:
> Also, on booking scanning, another member of the Transparency Hacker
> community tried this one:
>
> http://bookscanner.pbworks.com/w/page/40965440/FrontPage
>
> Said it works quite well (for the price and assembly-time).
>
> We're also discussing on building a machine to digitize law process
> (they have a prohibitive cost to copy around here and are ~7000 pages
> wide). The idea - wich is beein discussed with the guys from
> MetaMaquina (a hacker startup focused on 3d printers and printing) so
> the idea is to make it portable and light using some printed parts.
>
> (Still trying to get funds for this, thought)
>
> []'s
> Pedro Markun
>
> On Sat, Feb 18, 2012 at 8:28 PM, Pedro Markun <pedro at esfera.mobi>
> wrote:
> Hey all,
>
> I'm trying to accomplish something quite similar around here
> to get some (a lot!) of congressman speeches prior to ~98
> which are not digitized around here.
>
> Actually adapted the code from the OKFN DataDigitizer demo:
> http://blog.okfn.org/2011/11/17/introducing-the-data-digitizer/
> to be used as a ocring corrector tool.
>
> Realli rough on the edges, but it's working. There's also a
> python scraper I wrote to actually get the images and
> tesseract them.
>
> I'm now trying to create a training set, but still researching
> on proper fonts to emulate the old typewritter they used in
> the speeches records.
>
> (Everything is in a real slowmotion cuz I just had a baby. But
> will follow the thread :) )
>
> []'s
> Pedro Markun
>
>
> On Sat, Feb 18, 2012 at 9:06 AM, iain emsley
> <iain_emsley at austgate.co.uk> wrote:
> Morning,
>
> Having a catch up day at the moment. At last year's
> Textcamp, we
> discussed building a bookscanner in both hardware and
> software. The
> project's gone quiet for a bit but I've been thinking
> about in the last
> few days, post conversation with Mark at Dev8d about
> Bibserver.
>
> Looking on the Textus wiki, there is an action for an
> OCR to digitise
> books.
>
> I've written some test code here using Flask and the
> Tesseract OCR
> library (http://code.google.com/p/tesseract-ocr/)
> which has produced
> some, er, interesting results (some training wheels
> required!) but it is
> the basics of something.
>
> I had thought about forking Bibserver as the base
> project and writing
> something which allows an upload from web or file
> system (if one was to
> attach a camera or homebrew scanner to a laptop) and
> to load the image
> and get it parsed by Tesseract before being stored. I
> guess the next
> step would be to use something like sharejs to allow
> communal editing of
> the parsed text. I suppose something like Nodejs could
> be used as a
> listener on a file system to load any new images into
> the system but
> perhaps that is a little down the line. I know that
> there are other
> considerations (like getting the data, distributed
> proof-reading, and so
> on) but is this worth beginning to build some working
> code for?
>
> I can see there is conversation about it but was
> wondering where it was
> as I have a personal need for something like this and
> know some people
> who are in interested in the general project. I'd be
> interested in
> putting something together and posting it to my github
> account. I would
> however like to avoid treading on anyone's toes on
> this.
>
> I am also interested in building the book scanner as
> well. I know that
> the Instructables site has built one or two but there
> does also seem to
> be a pattern (from news reports about the British
> Library archives) for
> one which scans newspapers and larger items and is
> more of a flat bed
> scanner. I'm just curious about the hardware as well
> and perhaps coming
> up with something that can be shared to allow anyone
> else to build their
> own.
>
> Just a thought but is it something that can be fleshed
> out and made
> useful to the dev list?
>
> Thanks.
>
> Iain
>
>
> _______________________________________________
> humanities-dev mailing list
> humanities-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/humanities-dev
>
>
>
More information about the humanities-dev
mailing list