[okfn-labs] OCR API?

Matthew Fullerton matt.fullerton at gmail.com
Tue Jan 6 18:34:21 UTC 2015


Hi labs,
I now have a working setup here:
http://beta.offenedaten.de:9998/tika

This is using Apache Tika with Tesseract. So far I've installed English and
German languages, I think we could add Russian (
http://packages.ubuntu.com/trusty/tesseract-ocr-rus)

Test by doing things like:

    curl -T tiff_example.tif http:///beta.offenedaten.de:9998/tika

Fuller instructions here:
https://wiki.apache.org/tika/TikaOCR

You can run your own using Docker by doing:

    sudo docker build -t tika github.com/mattfullerton/tika-tesseract-docker
    sudo docker run -d -p 9998:9998 tika

I'm very open to improvements to the Docker build files, I am no expert
there.

What is lacking now (AFAIK) is detection that standard text extraction from
a PDF 'failed' with a fallback to tesseract. We should look into that.

As discussed in greater detail on https://github.com/okfn/ideas/issues/88,
I think running a more complex, general service for data processing 'jobs'
as Friedrich has suggested is a great idea. I just didn't have time to
contribute much to that direction so far.

Best,
Matt


On 1 November 2014 at 12:06, Christian Ledermann <
christian.ledermann at gmail.com> wrote:

> maybe scavenge some code from
>
> https://plone.org/products/pdftoocr
> or
> https://www.nathanvangheem.com/news/using-plone-as-a-document-repository
>
> sorry I am short - sprinting
>
> On 31 October 2014 14:40, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> > I've booted this idea issue for us to track stuff:
> >
> > https://github.com/okfn/ideas/issues/88
> >
> > (As this idea is a bit different from the PDF to text I linked earlier).
> > Please dump any links or thoughts in there as well as on list :-)
> >
> > Rufus
> >
> > On 31 October 2014 14:22, Friedrich Lindenberg
> > <friedrich.lindenberg at okfn.org> wrote:
> >>
> >> Especially their online service, ocrsdk.com ist very good - we tried it
> >> out with yanukovichleaks for Russian and Ukrainian docs, but they
> wouldn't
> >> give us any free queries :) Still, I think we need to have a better open
> >> source stack, for all of those projects where you don't really want to
> send
> >> all of your documents to three dozen untrusted services.
> >>
> >> - Friedrich
> >>
> >> On Fri, Oct 31, 2014 at 4:29 PM, Ivan Begtin <ibegtin at gmail.com> wrote:
> >>>
> >>> Hi Matthew.
> >>>
> >>> Do-it-yourself OCR is great idea, just it's not so simple for languages
> >>> other than English. I don't know any free OCR engine or service for
> Russian
> >>> language.
> >>>
> >>> A few times we partnered with Russian OCR company - Abbyy and they
> >>> provided as free access to online OCR API - http://www.abbyyonline.com
> >>>
> >>> I could ask them to provide free access to their OCR service. I can't
> >>> promise that they will do it, but chance is much more than 50%.
> >>>
> >>>
> >>> Best Regards,
> >>>    Ivan Begtin
> >>>
> >>> 2014-10-31 14:51 GMT+03:00 Matthew Fullerton <matt.fullerton at gmail.com
> >:
> >>>>
> >>>> Hi okfn-labs,
> >>>> The need for OCR comes up again and again. I suppose often enough its
> >>>> enough to run the document(s) through the program of choice (I would
> be
> >>>> interested in what this is right now) and then deal with the
> consequences.
> >>>> What I want to know is whether there are any do/-it-yourself
> service/API
> >>>> solutions out there that I could get up and running on a server to
> use on a
> >>>> permanent basis?
> >>>>
> >>>> If there isn't one yet I will probably be helping to cobble one
> >>>> together. So advice, experience and expressions of interest would
> also be
> >>>> appreciated.
> >>>>
> >>>> Best,
> >>>> Matt
> >>>>
> >>>> _______________________________________________
> >>>> okfn-labs mailing list
> >>>> okfn-labs at lists.okfn.org
> >>>> https://lists.okfn.org/mailman/listinfo/okfn-labs
> >>>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Best Regards,
> >>>    Ivan Begtin
> >>>
> >>> Director of NGO "Informational Culture"
> >>> email: ibegtin at infoculture.ru
> >>> phone: +7 499 500 96 58, +7 910 426 68 83
> >>> website: http://infoculture.ru
> >>>
> >>> _______________________________________________
> >>> okfn-labs mailing list
> >>> okfn-labs at lists.okfn.org
> >>> https://lists.okfn.org/mailman/listinfo/okfn-labs
> >>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
> >>>
> >>
> >>
> >> _______________________________________________
> >> okfn-labs mailing list
> >> okfn-labs at lists.okfn.org
> >> https://lists.okfn.org/mailman/listinfo/okfn-labs
> >> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
> >>
> >
> >
> >
> > --
> >
> > Rufus Pollock
> >
> > Founder and President | skype: rufuspollock | @rufuspollock
> >
> > Open Knowledge - see how data can change the world
> >
> > http://okfn.org/ | @okfn | Open Knowledge on Facebook |  Blog
> >
> > The Open Knowledge Foundation is a not-for-profit organisation.  It is
> > incorporated in England & Wales as a company limited by guarantee, with
> > company number 05133759.  VAT Registration № GB 984404989. Registered
> office
> > address: Open Knowledge Foundation, St John’s Innovation Centre, Cowley
> > Road, Cambridge, CB4 0WS, UK.
> >
> >
> > _______________________________________________
> > okfn-labs mailing list
> > okfn-labs at lists.okfn.org
> > https://lists.okfn.org/mailman/listinfo/okfn-labs
> > Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
> >
>
>
>
> --
> Best Regards,
>
> Christian Ledermann
>
> London - UK
> Mobile : +44 7474997517
>
> <*)))>{
>
> If you save the living environment, the biodiversity that we have left,
> you will also automatically save the physical environment, too. But If
> you only save the physical environment, you will ultimately lose both.
>
> 1) Don’t drive species to extinction
>
> 2) Don’t destroy a habitat that species rely on.
>
> 3) Don’t change the climate in ways that will result in the above.
>
> }<(((*>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20150106/56225f47/attachment-0003.html>


More information about the okfn-labs mailing list