[okfn-labs] [open-humanities] Best practice for OCR workflows (re OED 1st edition project)

James Cummings james.cummings at it.ox.ac.uk
Mon Aug 26 07:25:50 UTC 2013


Hi all,

Dare I suggest that the TEI dictionaries module has been created for just this sort of data?
Thid can be as detailed or simplistic as one needs and more importantly one can document exactly how one is using the TEI in a machine processable customisation file.  Having done similar conversions before suggests it will all depend on whether there is sufficient granularity of markup in the OCR conversion of the presentational markup from the original source.

James

--
Dr James Cummings, Academic IT Services, University of Oxford



-------- Original message --------
From: Rufus Pollock <rufus.pollock at okfn.org>
Date: 2013/08/26 03:28 (GMT+00:00)
To: Jonathan Gray <jonathan.gray at okfn.org>
Cc: okfn-labs <okfn-labs at lists.okfn.org>,open-humanities <open-humanities at lists.okfn.org>,Adam Green <adam.green at okfn.org>,Public Domain discuss list <pd-discuss at lists.okfn.org>
Subject: Re: [open-humanities] [okfn-labs] Best practice for OCR workflows (re OED 1st edition project)


On 24 August 2013 11:28, Jonathan Gray <jonathan.gray at okfn.org<mailto:jonathan.gray at okfn.org>> wrote:
Regarding plans for an open version of the 1st edition of the OED, I thought some of you might be interested in this piece from Cory Doctorow yesterday:

http://www.theguardian.com/technology/2013/aug/23/oxford-english-dictionary-future-digitally

What do we need to move forward with an Open OED project [1]? It would be really cool if there were any way to break the dictionary down into entries that people could help to proofread and correct. Any thoughts on that front? Anyone else interested in helping?

Right now it would not need a lot to move forward - a small amount of time / effort could probably result in headway being made :-) Right now (see plan below) what's most needed is someone to spend an hour or two getting a massive XML OCR output into a more useable form. I'm a bit time-constrained at present so it probably won't be me immediately ;-) but I've put together an outline of what you'd probably want to do<https://github.com/okfn/oed/issues/3> - most of inlined below (advice here warmly welcomed - I'm something of a novice with the OCR and Abby XML stuff!)

More detail ...

Thanks to Tom Morris and others we now know where we can get the Abby OCR versions the Internet Archive automatically make - see the thread in the Ideas issue here:

https://github.com/okfn/ideas/issues/50

I've just knocked together a very rough plan here: https://github.com/okfn/oed/issues/1

<quote>
## Do a Trial Run (first 20 pages of vol 1)

* [x] Locate scans #2<https://github.com/okfn/oed/issues/2>
* [x] Get hold of OCR of volume 1 - #3<https://github.com/okfn/oed/issues/3> - In Progress - we can grab Internet Archive versions e.g. https://ia600401.us.archive.org/7/items/oed01arch/oed01arch_abbyy.gz
* [ ] Check the OCR and proof-edit - #4<https://github.com/okfn/oed/issues/4>
* [ ] Based on this experience work how we scale
</quote>

Right now the place we need to do work is getting usable OCR output. As documented in the issue<https://github.com/okfn/oed/issues/3> my guess of the steps is:

* Grab the XML from Internet Archive
* Gunzip
* Convert the massive XML to something more manageable, suggest both
  * Chop into per page
  * Go from full Abby XML to something better
* Upload the results somewhere shareable - e.g. assets.okfnlabs.org/p/oed<http://assets.okfnlabs.org/p/oed>
* Move on to #4 (review and proof)

If there were any simple tasks that people could do, Adam said he could help publicise this to readers of the Public Domain Review.

That would be great. As Tom Morris pointed out there's a neat firefox extension that allows you to OCR correction in the browser. There are plenty of other neat approaches we could try from etherpads to browser extensions and git repos ...

Rufus


[1] https://github.com/okfn/oed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130826/81704ff5/attachment-0001.html>


More information about the okfn-labs mailing list