[okfn-labs] OCR API?

todd.d.robbins at gmail.com todd.d.robbins at gmail.com
Tue Jan 6 21:38:42 UTC 2015


Matt,

This is great! I just tested it out on some JP2 and PNG I had lying around
and it worked well.

–Tod


On Tue, Jan 6, 2015 at 11:34 AM, Matthew Fullerton <matt.fullerton at gmail.com
> wrote:

> Hi labs,
> I now have a working setup here:
> http://beta.offenedaten.de:9998/tika
>
> This is using Apache Tika with Tesseract. So far I've installed English
> and German languages, I think we could add Russian (
> http://packages.ubuntu.com/trusty/tesseract-ocr-rus)
>
> Test by doing things like:
>
>     curl -T tiff_example.tif http:///beta.offenedaten.de:9998/tika
>
> Fuller instructions here:
> https://wiki.apache.org/tika/TikaOCR
>
> You can run your own using Docker by doing:
>
>     sudo docker build -t tika
> github.com/mattfullerton/tika-tesseract-docker
>     sudo docker run -d -p 9998:9998 tika
>
> I'm very open to improvements to the Docker build files, I am no expert
> there.
>
> What is lacking now (AFAIK) is detection that standard text extraction
> from a PDF 'failed' with a fallback to tesseract. We should look into that.
>
> As discussed in greater detail on https://github.com/okfn/ideas/issues/88,
> I think running a more complex, general service for data processing 'jobs'
> as Friedrich has suggested is a great idea. I just didn't have time to
> contribute much to that direction so far.
>
> Best,
> Matt
>
>
> On 1 November 2014 at 12:06, Christian Ledermann <
> christian.ledermann at gmail.com> wrote:
>
>> maybe scavenge some code from
>>
>> https://plone.org/products/pdftoocr
>> or
>> https://www.nathanvangheem.com/news/using-plone-as-a-document-repository
>>
>> sorry I am short - sprinting
>>
>> On 31 October 2014 14:40, Rufus Pollock <rufus.pollock at okfn.org> wrote:
>> > I've booted this idea issue for us to track stuff:
>> >
>> > https://github.com/okfn/ideas/issues/88
>> >
>> > (As this idea is a bit different from the PDF to text I linked earlier).
>> > Please dump any links or thoughts in there as well as on list :-)
>> >
>> > Rufus
>> >
>> > On 31 October 2014 14:22, Friedrich Lindenberg
>> > <friedrich.lindenberg at okfn.org> wrote:
>> >>
>> >> Especially their online service, ocrsdk.com ist very good - we tried
>> it
>> >> out with yanukovichleaks for Russian and Ukrainian docs, but they
>> wouldn't
>> >> give us any free queries :) Still, I think we need to have a better
>> open
>> >> source stack, for all of those projects where you don't really want to
>> send
>> >> all of your documents to three dozen untrusted services.
>> >>
>> >> - Friedrich
>> >>
>> >> On Fri, Oct 31, 2014 at 4:29 PM, Ivan Begtin <ibegtin at gmail.com>
>> wrote:
>> >>>
>> >>> Hi Matthew.
>> >>>
>> >>> Do-it-yourself OCR is great idea, just it's not so simple for
>> languages
>> >>> other than English. I don't know any free OCR engine or service for
>> Russian
>> >>> language.
>> >>>
>> >>> A few times we partnered with Russian OCR company - Abbyy and they
>> >>> provided as free access to online OCR API -
>> http://www.abbyyonline.com
>> >>>
>> >>> I could ask them to provide free access to their OCR service. I can't
>> >>> promise that they will do it, but chance is much more than 50%.
>> >>>
>> >>>
>> >>> Best Regards,
>> >>>    Ivan Begtin
>> >>>
>> >>> 2014-10-31 14:51 GMT+03:00 Matthew Fullerton <
>> matt.fullerton at gmail.com>:
>> >>>>
>> >>>> Hi okfn-labs,
>> >>>> The need for OCR comes up again and again. I suppose often enough its
>> >>>> enough to run the document(s) through the program of choice (I would
>> be
>> >>>> interested in what this is right now) and then deal with the
>> consequences.
>> >>>> What I want to know is whether there are any do/-it-yourself
>> service/API
>> >>>> solutions out there that I could get up and running on a server to
>> use on a
>> >>>> permanent basis?
>> >>>>
>> >>>> If there isn't one yet I will probably be helping to cobble one
>> >>>> together. So advice, experience and expressions of interest would
>> also be
>> >>>> appreciated.
>> >>>>
>> >>>> Best,
>> >>>> Matt
>> >>>>
>> >>>> _______________________________________________
>> >>>> okfn-labs mailing list
>> >>>> okfn-labs at lists.okfn.org
>> >>>> https://lists.okfn.org/mailman/listinfo/okfn-labs
>> >>>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Best Regards,
>> >>>    Ivan Begtin
>> >>>
>> >>> Director of NGO "Informational Culture"
>> >>> email: ibegtin at infoculture.ru
>> >>> phone: +7 499 500 96 58, +7 910 426 68 83
>> >>> website: http://infoculture.ru
>> >>>
>> >>> _______________________________________________
>> >>> okfn-labs mailing list
>> >>> okfn-labs at lists.okfn.org
>> >>> https://lists.okfn.org/mailman/listinfo/okfn-labs
>> >>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>> >>>
>> >>
>> >>
>> >> _______________________________________________
>> >> okfn-labs mailing list
>> >> okfn-labs at lists.okfn.org
>> >> https://lists.okfn.org/mailman/listinfo/okfn-labs
>> >> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>> >>
>> >
>> >
>> >
>> > --
>> >
>> > Rufus Pollock
>> >
>> > Founder and President | skype: rufuspollock | @rufuspollock
>> >
>> > Open Knowledge - see how data can change the world
>> >
>> > http://okfn.org/ | @okfn | Open Knowledge on Facebook |  Blog
>> >
>> > The Open Knowledge Foundation is a not-for-profit organisation.  It is
>> > incorporated in England & Wales as a company limited by guarantee, with
>> > company number 05133759.  VAT Registration № GB 984404989. Registered
>> office
>> > address: Open Knowledge Foundation, St John’s Innovation Centre, Cowley
>> > Road, Cambridge, CB4 0WS, UK.
>> >
>> >
>> > _______________________________________________
>> > okfn-labs mailing list
>> > okfn-labs at lists.okfn.org
>> > https://lists.okfn.org/mailman/listinfo/okfn-labs
>> > Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>> >
>>
>>
>>
>> --
>> Best Regards,
>>
>> Christian Ledermann
>>
>> London - UK
>> Mobile : +44 7474997517
>>
>> <*)))>{
>>
>> If you save the living environment, the biodiversity that we have left,
>> you will also automatically save the physical environment, too. But If
>> you only save the physical environment, you will ultimately lose both.
>>
>> 1) Don’t drive species to extinction
>>
>> 2) Don’t destroy a habitat that species rely on.
>>
>> 3) Don’t change the climate in ways that will result in the above.
>>
>> }<(((*>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>>
>
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>
>


-- 
Tod Robbins
Digital Asset Manager, MLIS
todrobbins.com | @todrobbins <http://www.twitter.com/#!/todrobbins>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20150106/25db838c/attachment-0004.html>


More information about the okfn-labs mailing list