[okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Hans Thompson hans.thompson1 at gmail.com
Fri Jun 21 16:49:06 UTC 2013


I started a conversation on a similar topic a couple months ago and wanted
to share my progress on an R based pdf table extractor that uses microtasks
to set column and rows. It also includes some QC functions for the ocr and
transcription aspects of transcribing each cell.

https://github.com/hansthompson/pdf.table.converteR

Hans Thompson


On Fri, Jun 21, 2013 at 8:03 AM, Stefan Wehrmeyer <stefan.wehrmeyer at okfn.org
> wrote:

> Hi Rufus,
>
> inspired by Tabula (http://tabula.nerdpower.org) I started working on
> table extraction from images in PDFs with the goal to define a structure
> template for a page and then use that template on subsequent pages:
> https://github.com/stefanw/carpenter
>
> Carpenter uses OpenCV on images to detect tables and tesseract on the
> table cells to extract text and limits the set of characters to
> digits/punctuation if the cell contains likely a number value. With such
> tricks I had moderately promising success, but did not continue further.
>
> I wanted Carpenter to become a web interface (think Refine) for structured
> OCR extraction tasks, template definition and OCR training. It's far from
> finished, but may be a starting point if you want to work on the topic.
>
> Cheers
> Stefan
>
> On 20.06.2013, at 21:34 , Rufus Pollock <rufus.pollock at okfn.org> wrote:
>
> > Hi All,
> >
> > I'm writing to ask folks to share their recommendations on workflows and
> open tools for extracting (machine-readable) text from (bulk) scanned text
> (i.e. OCR etc).
> >
> > (Note I'm also, like many other folks, interested in extracting
> structured data (e.g. tables) from normal or scanned PDF but I'm not asking
> about that in this thread ...)
> >
> > Context - or the problems I'm interested in
> >
> > I've been interested in making the 1st edition of the Oxford English
> Dictionary (now in public domain) available online in an open form for a
> while - here's the entry in the Labs Ideas tracker about this which details
> some of the history [1].
> >
> > Key point is that thanks to Kragen Sitaker's work in 2005/2006 the whole
> text got scanned and is uploaded to archive.org. However, it needed (and
> needs AFAIK) OCR'ing and then proofing.
> >
> > Back in the day I took a stab at this using tesseract plus shell scripts
> (code now here https://github.com/okfn/oed) but it wasn't great:
> >
> > - Tesseract quality or non-standard dictionary text was poor
> > - Chopping up pages (both individual and columns from pages) needed
> bespoke automation and was error-prone
> > - Not clear what best way was to do proofing once done (for the work for
> Open Shakespeare and Encyclopaedia Britannica we just used a wiki)
> >
> > Things have obviously moved on in the last 5 years and I was wondering
> what's the best tools to use for this today (e.g. is tesseract still the
> best open-source option).
> >
> > Rufus
> >
> > PS: if you're also interested in this project please let me know :-)
> >
> > [1]: https://github.com/okfn/ideas/issues/50
> > _______________________________________________
> > okfn-labs mailing list
> > okfn-labs at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/okfn-labs
> > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
> --
> Stefan Wehrmeyer
> Projektleiter FragDenStaat.de
> stefan.wehrmeyer at okfn.org
> +49 151 15550559
> Open Knowledge Foundation Deutschland e.V.
> Gneisenaustr. 52
> 10961 Berlin
> http://www.okfn.de
>
> Spenden Sie für FragDenStaat.de:
> https://fragdenstaat.de/hilfe/spenden/
>
>
>
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130621/ee70d0aa/attachment-0002.html>


More information about the okfn-labs mailing list