[humanities-dev] OCRing text

iain emsley iain_emsley at austgate.co.uk
Mon Feb 27 20:30:14 UTC 2012


Sound, hence why I suggested starting work on it as I'd been doing
something on this. 

I had assumed (and hopefully the list will let me know), that Textus was
being built on the Bibserver code which is sort of where conversations
had been going as I understood them. 

So...

I've been working on using Tesseract and some Python-Tesseract wrappers
(using http://wiki.github.com/hoffstaetter/python-tesseract which is GPL
3 - is this OKD compliant as I haven't seen anything against if not I'll
have to write something that is). The idea is to take in some data from
a form (need to sort out the data model if this is to fit in with Textus
- do we have any movement here?) and return it in BibJSON. Something
along the lines of :
{ record: {
      author: [{name: <author name>}], 
      title : <title>, 
      description : <publish details>, 
      image_location: </path/to/stored/file>,
      text : <ocr'd text>
   }
}

which builds on the current model that I can see. It does need to be
extended and made more flexible but need to dive further into BibJSON. 

I had thought/hoped that it could be a plugin to Textus or other project
as well as being pluggable into Bibserver to create a web or network
based system as outlined in earlier email. 

If we could turn it into a basic system to store the scans for users
with an API, as Todd has just suggested, or to give scans a place then
all the better. It does perhaps raise some other questions such as could
we interact with the Distributed Proof readers. Another email and
conversation?

I must take a deeper look at PyBossa - did have a quick scan - but
equally that would be great to tap into. Again, perhaps another email
and conversation?

Building a DIY Bookscanner is largely a done project but thought it
might be fun to explore the area in terms of interacting with software.
It seemed to be a logical end point at the time to go from storing to
creating. 

Putting the code together appears relatively straight forward, it
appears to be training which might take time. 

Just some thoughts which turned out longer that I'd thought. Once I've
got something more concrete and in a repo, I'll write a blog post. 

Iain

On Mon, 2012-02-27 at 14:33 +0000, Rufus Pollock wrote:
> Just wanted to jump in here as OCR'ing stuff is something I (and
> others at the OKF) have long been interested in e.g.
> 
> <http://ideas.okfn.org/ideas/108/put-11th-edition-of-encylopaedia-brittanica-online-in-reusable-format>
> <http://ideas.okfn.org/ideas/20/oxford-english-dictionary-1st-ed-full-text-online>
> 
> I'm wondering if there is a need (and interest) in building a simple
> OCR service:
> 
> <http://ideas.okfn.org/ideas/106/pdf-tiff-scan-to-text-conversion-service>
> 
> Integrated with PyBossa / TEXTUS it could provide a nice scan / ocr /
> transcribe workflow.

> 
> On 19 February 2012 22:11, Pedro Markun <pedro at esfera.mobi> wrote:
> > I've tried OCRopus a bit and a least for brazillian portuguese I would
> > rather stick with tesseract.
> 
> That's a nice piece of info as I wondered whether OCRopus was useful.
> I last used tesseract a few years ago and it looks like it has come on
> quite a bit.
> 
> > The first results are quite nice if you got a good resolution picture in the
> > right TIFF format. For specifics sets of documents I was hoping to build an
> > web app wich makes it easier to build the training sets - auto generates the
> > text with correct spacing, font so people can print it a home, scanit and
> > upload back through a web interface? -
> 
> Nice :-)
> 
> Rufus
> 
> > About mobile, the idea behind the 3d printed bookscanner is exaclty that.
> > Creating a lightweight system which can be assembled quickly (the first
> > sketches looks like a war-of-the-world tripod with two led lamps to
> > iluminate the text) and can be carried around.
> >
> > After the images are captured, it will be streamlined through a script which
> > will convert it to the proper 2bit TIFF format, (ideally) adjust brightness
> > and contrast and then ocr-it.
> >
> > Then it will expose online both the scan and the ocrtext, so people can
> > improve the text. Ideally using some sort of overlap layer (at least for
> > proper positioning).
> >
> >
> > []'s
> > Pedro Markun
> >
> >
> > On Sun, Feb 19, 2012 at 10:29 AM, iain emsley <iain_emsley at austgate.co.uk>
> > wrote:
> >>
> >> Todd,
> >>
> >> I've stayed away from ocropus so far because the build process just
> >> seems unnecessarily tortuous. Time to dive in!
> >>
> >> My sense is that it uses Tesseract as underlying engine so it copes with
> >> some of the language issues. This version appears to be under some heavy
> >> development to make it more Python based and less reliant on C++ so
> >> perhaps this will make it easier in future releases.
> >>
> >> I'll probably dive into it soon enough and give it a go.
> >>
> >> Iain
> >>
> >>
> >> On Sat, 2012-02-18 at 14:38 -0800, todd.d.robbins at gmail.com wrote:
> >> > What's the general sense of tesseract vs. ocropus? Which is better?
> >> > I've been trying to get ocropus to play nice with OS X and it's not
> >> > pretty.
> >> >
> >> > Tod
> >> > _______________________________________________
> >> > humanities-dev mailing list
> >> > humanities-dev at lists.okfn.org
> >> > http://lists.okfn.org/mailman/listinfo/humanities-dev
> >>
> >>
> >>
> >> _______________________________________________
> >> humanities-dev mailing list
> >> humanities-dev at lists.okfn.org
> >> http://lists.okfn.org/mailman/listinfo/humanities-dev
> >
> >
> >
> > _______________________________________________
> > humanities-dev mailing list
> > humanities-dev at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/humanities-dev
> >
> 
> 
> 






More information about the humanities-dev mailing list