[humanities-dev] OCRing text

Rufus Pollock rufus.pollock at okfn.org
Mon Feb 27 14:33:45 UTC 2012


Just wanted to jump in here as OCR'ing stuff is something I (and
others at the OKF) have long been interested in e.g.

<http://ideas.okfn.org/ideas/108/put-11th-edition-of-encylopaedia-brittanica-online-in-reusable-format>
<http://ideas.okfn.org/ideas/20/oxford-english-dictionary-1st-ed-full-text-online>

I'm wondering if there is a need (and interest) in building a simple
OCR service:

<http://ideas.okfn.org/ideas/106/pdf-tiff-scan-to-text-conversion-service>

Integrated with PyBossa / TEXTUS it could provide a nice scan / ocr /
transcribe workflow.

On 19 February 2012 22:11, Pedro Markun <pedro at esfera.mobi> wrote:
> I've tried OCRopus a bit and a least for brazillian portuguese I would
> rather stick with tesseract.

That's a nice piece of info as I wondered whether OCRopus was useful.
I last used tesseract a few years ago and it looks like it has come on
quite a bit.

> The first results are quite nice if you got a good resolution picture in the
> right TIFF format. For specifics sets of documents I was hoping to build an
> web app wich makes it easier to build the training sets - auto generates the
> text with correct spacing, font so people can print it a home, scanit and
> upload back through a web interface? -

Nice :-)

Rufus

> About mobile, the idea behind the 3d printed bookscanner is exaclty that.
> Creating a lightweight system which can be assembled quickly (the first
> sketches looks like a war-of-the-world tripod with two led lamps to
> iluminate the text) and can be carried around.
>
> After the images are captured, it will be streamlined through a script which
> will convert it to the proper 2bit TIFF format, (ideally) adjust brightness
> and contrast and then ocr-it.
>
> Then it will expose online both the scan and the ocrtext, so people can
> improve the text. Ideally using some sort of overlap layer (at least for
> proper positioning).
>
>
> []'s
> Pedro Markun
>
>
> On Sun, Feb 19, 2012 at 10:29 AM, iain emsley <iain_emsley at austgate.co.uk>
> wrote:
>>
>> Todd,
>>
>> I've stayed away from ocropus so far because the build process just
>> seems unnecessarily tortuous. Time to dive in!
>>
>> My sense is that it uses Tesseract as underlying engine so it copes with
>> some of the language issues. This version appears to be under some heavy
>> development to make it more Python based and less reliant on C++ so
>> perhaps this will make it easier in future releases.
>>
>> I'll probably dive into it soon enough and give it a go.
>>
>> Iain
>>
>>
>> On Sat, 2012-02-18 at 14:38 -0800, todd.d.robbins at gmail.com wrote:
>> > What's the general sense of tesseract vs. ocropus? Which is better?
>> > I've been trying to get ocropus to play nice with OS X and it's not
>> > pretty.
>> >
>> > Tod
>> > _______________________________________________
>> > humanities-dev mailing list
>> > humanities-dev at lists.okfn.org
>> > http://lists.okfn.org/mailman/listinfo/humanities-dev
>>
>>
>>
>> _______________________________________________
>> humanities-dev mailing list
>> humanities-dev at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/humanities-dev
>
>
>
> _______________________________________________
> humanities-dev mailing list
> humanities-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/humanities-dev
>



-- 
Co-Founder, Open Knowledge Foundation
Promoting Open Knowledge in a Digital Age
http://www.okfn.org/ - http://blog.okfn.org/




More information about the humanities-dev mailing list