[okfn-labs] Best practice for OCR workflows (re OED 1st edition project)
Stefan Wehrmeyer
stefan.wehrmeyer at okfn.org
Fri Jun 21 16:03:43 UTC 2013
Hi Rufus,
inspired by Tabula (http://tabula.nerdpower.org) I started working on table extraction from images in PDFs with the goal to define a structure template for a page and then use that template on subsequent pages:
https://github.com/stefanw/carpenter
Carpenter uses OpenCV on images to detect tables and tesseract on the table cells to extract text and limits the set of characters to digits/punctuation if the cell contains likely a number value. With such tricks I had moderately promising success, but did not continue further.
I wanted Carpenter to become a web interface (think Refine) for structured OCR extraction tasks, template definition and OCR training. It's far from finished, but may be a starting point if you want to work on the topic.
Cheers
Stefan
On 20.06.2013, at 21:34 , Rufus Pollock <rufus.pollock at okfn.org> wrote:
> Hi All,
>
> I'm writing to ask folks to share their recommendations on workflows and open tools for extracting (machine-readable) text from (bulk) scanned text (i.e. OCR etc).
>
> (Note I'm also, like many other folks, interested in extracting structured data (e.g. tables) from normal or scanned PDF but I'm not asking about that in this thread ...)
>
> Context - or the problems I'm interested in
>
> I've been interested in making the 1st edition of the Oxford English Dictionary (now in public domain) available online in an open form for a while - here's the entry in the Labs Ideas tracker about this which details some of the history [1].
>
> Key point is that thanks to Kragen Sitaker's work in 2005/2006 the whole text got scanned and is uploaded to archive.org. However, it needed (and needs AFAIK) OCR'ing and then proofing.
>
> Back in the day I took a stab at this using tesseract plus shell scripts (code now here https://github.com/okfn/oed) but it wasn't great:
>
> - Tesseract quality or non-standard dictionary text was poor
> - Chopping up pages (both individual and columns from pages) needed bespoke automation and was error-prone
> - Not clear what best way was to do proofing once done (for the work for Open Shakespeare and Encyclopaedia Britannica we just used a wiki)
>
> Things have obviously moved on in the last 5 years and I was wondering what's the best tools to use for this today (e.g. is tesseract still the best open-source option).
>
> Rufus
>
> PS: if you're also interested in this project please let me know :-)
>
> [1]: https://github.com/okfn/ideas/issues/50
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
--
Stefan Wehrmeyer
Projektleiter FragDenStaat.de
stefan.wehrmeyer at okfn.org
+49 151 15550559
Open Knowledge Foundation Deutschland e.V.
Gneisenaustr. 52
10961 Berlin
http://www.okfn.de
Spenden Sie für FragDenStaat.de:
https://fragdenstaat.de/hilfe/spenden/
More information about the okfn-labs
mailing list