[humanities-dev] OCRing text

Pedro Markun pedro at esfera.mobi
Sun Feb 19 22:11:50 UTC 2012


I've tried OCRopus a bit and a least for brazillian portuguese I would
rather stick with tesseract.

The first results are quite nice if you got a good resolution picture in
the right TIFF format. For specifics sets of documents I was hoping to
build an web app wich makes it easier to build the training sets - auto
generates the text with correct spacing, font so people can print it a
home, scanit and upload back through a web interface? -

About mobile, the idea behind the 3d printed bookscanner is exaclty that.
Creating a lightweight system which can be assembled quickly (the first
sketches looks like a war-of-the-world tripod with two led lamps to
iluminate the text) and can be carried around.

After the images are captured, it will be streamlined through a script
which will convert it to the proper 2bit TIFF format, (ideally) adjust
brightness and contrast and then ocr-it.

Then it will expose online both the scan and the ocrtext, so people can
improve the text. Ideally using some sort of overlap layer (at least for
proper positioning).


[]'s
Pedro Markun

On Sun, Feb 19, 2012 at 10:29 AM, iain emsley <iain_emsley at austgate.co.uk>wrote:

> Todd,
>
> I've stayed away from ocropus so far because the build process just
> seems unnecessarily tortuous. Time to dive in!
>
> My sense is that it uses Tesseract as underlying engine so it copes with
> some of the language issues. This version appears to be under some heavy
> development to make it more Python based and less reliant on C++ so
> perhaps this will make it easier in future releases.
>
> I'll probably dive into it soon enough and give it a go.
>
> Iain
>
>
> On Sat, 2012-02-18 at 14:38 -0800, todd.d.robbins at gmail.com wrote:
> > What's the general sense of tesseract vs. ocropus? Which is better?
> > I've been trying to get ocropus to play nice with OS X and it's not
> > pretty.
> >
> > Tod
> > _______________________________________________
> > humanities-dev mailing list
> > humanities-dev at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/humanities-dev
>
>
>
> _______________________________________________
> humanities-dev mailing list
> humanities-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/humanities-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/humanities-dev/attachments/20120219/b90009e7/attachment.html>


More information about the humanities-dev mailing list