[okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Fri Jun 21 16:03:43 UTC 2013

Hi Rufus,

inspired by Tabula (http://tabula.nerdpower.org) I started working on table extraction from images in PDFs with the goal to define a structure template for a page and then use that template on subsequent pages:
https://github.com/stefanw/carpenter

Carpenter uses OpenCV on images to detect tables and tesseract on the table cells to extract text and limits the set of characters to digits/punctuation if the cell contains likely a number value. With such tricks I had moderately promising success, but did not continue further.

I wanted Carpenter to become a web interface (think Refine) for structured OCR extraction tasks, template definition and OCR training. It's far from finished, but may be a starting point if you want to work on the topic.

Cheers
Stefan

On 20.06.2013, at 21:34 , Rufus Pollock <rufus.pollock at okfn.org> wrote:

> Hi All,
> 
> I'm writing to ask folks to share their recommendations on workflows and open tools for extracting (machine-readable) text from (bulk) scanned text (i.e. OCR etc).
> 
> (Note I'm also, like many other folks, interested in extracting structured data (e.g. tables) from normal or scanned PDF but I'm not asking about that in this thread ...)
> 
> Context - or the problems I'm interested in
> 
> I've been interested in making the 1st edition of the Oxford English Dictionary (now in public domain) available online in an open form for a while - here's the entry in the Labs Ideas tracker about this which details some of the history [1].
> 
> Key point is that thanks to Kragen Sitaker's work in 2005/2006 the whole text got scanned and is uploaded to archive.org. However, it needed (and needs AFAIK) OCR'ing and then proofing.
> 
> Back in the day I took a stab at this using tesseract plus shell scripts (code now here https://github.com/okfn/oed) but it wasn't great:
> 
> - Tesseract quality or non-standard dictionary text was poor
> - Chopping up pages (both individual and columns from pages) needed bespoke automation and was error-prone
> - Not clear what best way was to do proofing once done (for the work for Open Shakespeare and Encyclopaedia Britannica we just used a wiki)
> 
> Things have obviously moved on in the last 5 years and I was wondering what's the best tools to use for this today (e.g. is tesseract still the best open-source option).
> 
> Rufus
> 
> PS: if you're also interested in this project please let me know :-)
> 
> [1]: https://github.com/okfn/ideas/issues/50
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs

-- 
Stefan Wehrmeyer
Projektleiter FragDenStaat.de
stefan.wehrmeyer at okfn.org
+49 151 15550559
Open Knowledge Foundation Deutschland e.V.
Gneisenaustr. 52 
10961 Berlin
http://www.okfn.de

Spenden Sie für FragDenStaat.de:
https://fragdenstaat.de/hilfe/spenden/