[okfn-br] OCR Legislativo - Novo projeto OKFN Br

Pedro Markun pedro em esfera.mobi
Quinta Março 1 19:20:00 UTC 2012


Caros,

tenho conversado um pouco sobre isso nas listas internacionais da OKFN,
mais especificamente na humanities-dev em lists.okfn.org pra quem quiser ir
atrás do histórico... sobre usar OCR + Crowdsourcing pra digitalizar
grandes grupos de texto.

Tem rolado bastante discussão e já tem algum acumulo de experimentos ao
longo desse tempo - tem alguma coisa compilada no pad, mas vou tentar
documentar no wiki e centralizar as infos em seguida.

Mas acho que era legal botar a mão na massa, então vou começar a tocar por
aqui um projeto para testar o combo:
OCR + Textus + PyBossa pra criar um framework de transcrição de documentos
e vou usar como set inicial os discursos históricos publicados no diário
oficial entre 66 e 68 - já que isso tem uma aplicação prática para o livro
que meu pai esta escrevendo e pra alguns apps daa THacker.

Só dando um alô, pra quem estiver interessado em colaborar... chegar junto.
Mas por enquanto acho que da pra gente ir reportando progressos e updates
aqui pela lista mesmo, então todo mundo vai ficar no loop.

abs,
Pedro Markun


---------- Forwarded message ----------
From: Rufus Pollock <rufus.pollock em okfn.org>
Date: Mon, Feb 27, 2012 at 11:33 AM
Subject: Re: [humanities-dev] OCRing text
To: Pedro Markun <pedro em esfera.mobi>
Cc: iain emsley <iain_emsley em austgate.co.uk>, humanities-dev em lists.okfn.org


Just wanted to jump in here as OCR'ing stuff is something I (and
others at the OKF) have long been interested in e.g.

<
http://ideas.okfn.org/ideas/108/put-11th-edition-of-encylopaedia-brittanica-online-in-reusable-format
>
<
http://ideas.okfn.org/ideas/20/oxford-english-dictionary-1st-ed-full-text-online
>

I'm wondering if there is a need (and interest) in building a simple
OCR service:

<http://ideas.okfn.org/ideas/106/pdf-tiff-scan-to-text-conversion-service>

Integrated with PyBossa / TEXTUS it could provide a nice scan / ocr /
transcribe workflow.

On 19 February 2012 22:11, Pedro Markun <pedro em esfera.mobi> wrote:
> I've tried OCRopus a bit and a least for brazillian portuguese I would
> rather stick with tesseract.

That's a nice piece of info as I wondered whether OCRopus was useful.
I last used tesseract a few years ago and it looks like it has come on
quite a bit.

> The first results are quite nice if you got a good resolution picture in
the
> right TIFF format. For specifics sets of documents I was hoping to build
an
> web app wich makes it easier to build the training sets - auto generates
the
> text with correct spacing, font so people can print it a home, scanit and
> upload back through a web interface? -

Nice :-)

Rufus

> About mobile, the idea behind the 3d printed bookscanner is exaclty that.
> Creating a lightweight system which can be assembled quickly (the first
> sketches looks like a war-of-the-world tripod with two led lamps to
> iluminate the text) and can be carried around.
>
> After the images are captured, it will be streamlined through a script
which
> will convert it to the proper 2bit TIFF format, (ideally) adjust
brightness
> and contrast and then ocr-it.
>
> Then it will expose online both the scan and the ocrtext, so people can
> improve the text. Ideally using some sort of overlap layer (at least for
> proper positioning).
>
>
> []'s
> Pedro Markun
>
>
> On Sun, Feb 19, 2012 at 10:29 AM, iain emsley <iain_emsley em austgate.co.uk>
> wrote:
>>
>> Todd,
>>
>> I've stayed away from ocropus so far because the build process just
>> seems unnecessarily tortuous. Time to dive in!
>>
>> My sense is that it uses Tesseract as underlying engine so it copes with
>> some of the language issues. This version appears to be under some heavy
>> development to make it more Python based and less reliant on C++ so
>> perhaps this will make it easier in future releases.
>>
>> I'll probably dive into it soon enough and give it a go.
>>
>> Iain
>>
>>
>> On Sat, 2012-02-18 at 14:38 -0800, todd.d.robbins em gmail.com wrote:
>> > What's the general sense of tesseract vs. ocropus? Which is better?
>> > I've been trying to get ocropus to play nice with OS X and it's not
>> > pretty.
>> >
>> > Tod
>> > _______________________________________________
>> > humanities-dev mailing list
>> > humanities-dev em lists.okfn.org
>> > http://lists.okfn.org/mailman/listinfo/humanities-dev
>>
>>
>>
>> _______________________________________________
>> humanities-dev mailing list
>> humanities-dev em lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/humanities-dev
>
>
>
> _______________________________________________
> humanities-dev mailing list
> humanities-dev em lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/humanities-dev
>



--
Co-Founder, Open Knowledge Foundation
Promoting Open Knowledge in a Digital Age
http://www.okfn.org/ - http://blog.okfn.org/
-------------- Próxima Parte ----------
Um anexo em HTML foi limpo...
URL: <http://lists.okfn.org/pipermail/okfn-br/attachments/20120301/71184e5f/attachment-0002.html>


Mais detalhes sobre a lista de discussão okfn-br