[pd-discuss] [open-humanities] [okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Mon Aug 26 09:10:38 UTC 2013

Forwarding James's reply, which initially bounced...

On 26 August 2013 09:25, James Cummings <james.cummings at it.ox.ac.uk> wrote:

>  Hi all,
>
>  Dare I suggest that the TEI dictionaries module has been created for
> just this sort of data?
> Thid can be as detailed or simplistic as one needs and more importantly
> one can document exactly how one is using the TEI in a machine processable
> customisation file.  Having done similar conversions before suggests it
> will all depend on whether there is sufficient granularity of markup in the
> OCR conversion of the presentational markup from the original source.
>
>  James
>
>  --
> Dr James Cummings, Academic IT Services, University of Oxford
>
>
>
> -------- Original message --------
> From: Rufus Pollock <rufus.pollock at okfn.org>
> Date: 2013/08/26 03:28 (GMT+00:00)
> To: Jonathan Gray <jonathan.gray at okfn.org>
> Cc: okfn-labs <okfn-labs at lists.okfn.org>,open-humanities <
> open-humanities at lists.okfn.org>,Adam Green <adam.green at okfn.org>,Public
> Domain discuss list <pd-discuss at lists.okfn.org>
> Subject: Re: [open-humanities] [okfn-labs] Best practice for OCR workflows
> (re OED 1st edition project)
>
>
>  On 24 August 2013 11:28, Jonathan Gray <jonathan.gray at okfn.org> wrote:
>
>> Regarding plans for an open version of the 1st edition of the OED, I
>> thought some of you might be interested in this piece from Cory Doctorow
>> yesterday:
>>
>>
>> http://www.theguardian.com/technology/2013/aug/23/oxford-english-dictionary-future-digitally
>>
>>  What do we need to move forward with an Open OED project [1]? It would
>> be really cool if there were any way to break the dictionary down into
>> entries that people could help to proofread and correct. Any thoughts on
>> that front? Anyone else interested in helping?
>>
>
>  Right now it would not need a lot to move forward - a small amount of
> time / effort could probably result in headway being made :-) Right now
> (see plan below) what's most needed is *someone to spend an hour or two*getting a massive XML OCR output into a more useable form. I'm a bit
> time-constrained at present so it probably won't be me immediately ;-) but
> I've put together an outline of what you'd probably want to do<https://github.com/okfn/oed/issues/3> -
> most of inlined below (advice here warmly welcomed - I'm something of a
> novice with the OCR and Abby XML stuff!)
>
>  *More detail ...*
>
>  Thanks to Tom Morris and others we now know where we can get the Abby
> OCR versions the Internet Archive automatically make - see the thread in
> the Ideas issue here:
>
>  https://github.com/okfn/ideas/issues/50
>
>  I've just knocked together a very rough plan here:
> https://github.com/okfn/oed/issues/1
>
>  <quote>
>  ## Do a Trial Run (first 20 pages of vol 1)
>
>  * [x] Locate scans #2 <https://github.com/okfn/oed/issues/2>
> * [x] Get hold of OCR of volume 1 - #3<https://github.com/okfn/oed/issues/3>- In Progress - we can grab Internet Archive versions e.g.
> https://ia600401.us.archive.org/7/items/oed01arch/oed01arch_abbyy.gz
> * [ ] Check the OCR and proof-edit - #4<https://github.com/okfn/oed/issues/4>
>
> * [ ] Based on this experience work how we scale
>  </quote>
>
>  Right now the place we need to do work is getting usable OCR output. As documented
> in the issue <https://github.com/okfn/oed/issues/3> my guess of the steps
> is:
>
>  * Grab the XML from Internet Archive
> * Gunzip
> * Convert the massive XML to something more manageable, suggest both
>   * Chop into per page
>   * Go from full Abby XML to something better
> * Upload the results somewhere shareable - e.g. assets.okfnlabs.org/p/oed
> * Move on to #4 (review and proof)
>
>
>>  If there were any simple tasks that people could do, Adam said he could
>> help publicise this to readers of the Public Domain Review.
>>
>
>  That would be great. As Tom Morris pointed out there's a neat firefox
> extension that allows you to OCR correction in the browser. There are
> plenty of other neat approaches we could try from etherpads to browser
> extensions and git repos ...
>
>  Rufus
>
>
>>
>>  [1] https://github.com/okfn/oed
>>
>  *
>
>  *
>

-- 

Jonathan Gray

Director of Policy and Ideas  | *@jwyg <https://twitter.com/jwyg>*

The Open Knowledge Foundation <http://okfn.org/>
*

Empowering through Open Knowledge

okfn.org  |  @okfn <http://twitter.com/OKFN>  |  OKF on
Facebook<https://www.facebook.com/OKFNetwork> |
Blog <http://blog.okfn.org/>  |  Newsletter<http://okfn.org/about/newsletter>
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/pd-discuss/attachments/20130826/7fce4542/attachment.html>