[pd-discuss] [okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Mon Aug 26 01:52:42 UTC 2013

On 24 August 2013 11:28, Jonathan Gray <jonathan.gray at okfn.org> wrote:

> Regarding plans for an open version of the 1st edition of the OED, I
> thought some of you might be interested in this piece from Cory Doctorow
> yesterday:
>
>
> http://www.theguardian.com/technology/2013/aug/23/oxford-english-dictionary-future-digitally
>
> What do we need to move forward with an Open OED project [1]? It would be
> really cool if there were any way to break the dictionary down into entries
> that people could help to proofread and correct. Any thoughts on that
> front? Anyone else interested in helping?
>

Right now it would not need a lot to move forward - a small amount of time
/ effort could probably result in headway being made :-) Right now (see
plan below) what's most needed is *someone to spend an hour or two* getting
a massive XML OCR output into a more useable form. I'm a bit
time-constrained at present so it probably won't be me immediately ;-) but
I've put together an outline of what you'd probably want to
do<https://github.com/okfn/oed/issues/3> -
most of inlined below (advice here warmly welcomed - I'm something of a
novice with the OCR and Abby XML stuff!)

*More detail ...*

Thanks to Tom Morris and others we now know where we can get the Abby OCR
versions the Internet Archive automatically make - see the thread in the
Ideas issue here:

https://github.com/okfn/ideas/issues/50

I've just knocked together a very rough plan here:
https://github.com/okfn/oed/issues/1

<quote>
## Do a Trial Run (first 20 pages of vol 1)

* [x] Locate scans #2 <https://github.com/okfn/oed/issues/2>
* [x] Get hold of OCR of volume 1 -
#3<https://github.com/okfn/oed/issues/3>- In Progress - we can grab
Internet Archive versions e.g.
https://ia600401.us.archive.org/7/items/oed01arch/oed01arch_abbyy.gz
* [ ] Check the OCR and proof-edit - #4<https://github.com/okfn/oed/issues/4>

* [ ] Based on this experience work how we scale
</quote>

Right now the place we need to do work is getting usable OCR output.
As documented
in the issue <https://github.com/okfn/oed/issues/3> my guess of the steps
is:

* Grab the XML from Internet Archive
* Gunzip
* Convert the massive XML to something more manageable, suggest both
  * Chop into per page
  * Go from full Abby XML to something better
* Upload the results somewhere shareable - e.g. assets.okfnlabs.org/p/oed
* Move on to #4 (review and proof)

> If there were any simple tasks that people could do, Adam said he could
> help publicise this to readers of the Public Domain Review.
>

That would be great. As Tom Morris pointed out there's a neat firefox
extension that allows you to OCR correction in the browser. There are
plenty of other neat approaches we could try from etherpads to browser
extensions and git repos ...

Rufus

>
> [1] https://github.com/okfn/oed
>
*

*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/pd-discuss/attachments/20130826/84ad5f5a/attachment.html>