[pd-discuss] [open-humanities] [okfn-labs] Best practice for OCR workflows (re OED 1st edition project)
Jonathan Gray
jonathan.gray at okfn.org
Mon Aug 26 09:10:38 UTC 2013
Forwarding James's reply, which initially bounced...
On 26 August 2013 09:25, James Cummings <james.cummings at it.ox.ac.uk> wrote:
> Hi all,
>
> Dare I suggest that the TEI dictionaries module has been created for
> just this sort of data?
> Thid can be as detailed or simplistic as one needs and more importantly
> one can document exactly how one is using the TEI in a machine processable
> customisation file. Having done similar conversions before suggests it
> will all depend on whether there is sufficient granularity of markup in the
> OCR conversion of the presentational markup from the original source.
>
> James
>
> --
> Dr James Cummings, Academic IT Services, University of Oxford
>
>
>
> -------- Original message --------
> From: Rufus Pollock <rufus.pollock at okfn.org>
> Date: 2013/08/26 03:28 (GMT+00:00)
> To: Jonathan Gray <jonathan.gray at okfn.org>
> Cc: okfn-labs <okfn-labs at lists.okfn.org>,open-humanities <
> open-humanities at lists.okfn.org>,Adam Green <adam.green at okfn.org>,Public
> Domain discuss list <pd-discuss at lists.okfn.org>
> Subject: Re: [open-humanities] [okfn-labs] Best practice for OCR workflows
> (re OED 1st edition project)
>
>
> On 24 August 2013 11:28, Jonathan Gray <jonathan.gray at okfn.org> wrote:
>
>> Regarding plans for an open version of the 1st edition of the OED, I
>> thought some of you might be interested in this piece from Cory Doctorow
>> yesterday:
>>
>>
>> http://www.theguardian.com/technology/2013/aug/23/oxford-english-dictionary-future-digitally
>>
>> What do we need to move forward with an Open OED project [1]? It would
>> be really cool if there were any way to break the dictionary down into
>> entries that people could help to proofread and correct. Any thoughts on
>> that front? Anyone else interested in helping?
>>
>
> Right now it would not need a lot to move forward - a small amount of
> time / effort could probably result in headway being made :-) Right now
> (see plan below) what's most needed is *someone to spend an hour or two*getting a massive XML OCR output into a more useable form. I'm a bit
> time-constrained at present so it probably won't be me immediately ;-) but
> I've put together an outline of what you'd probably want to do<https://github.com/okfn/oed/issues/3> -
> most of inlined below (advice here warmly welcomed - I'm something of a
> novice with the OCR and Abby XML stuff!)
>
> *More detail ...*
>
> Thanks to Tom Morris and others we now know where we can get the Abby
> OCR versions the Internet Archive automatically make - see the thread in
> the Ideas issue here:
>
> https://github.com/okfn/ideas/issues/50
>
> I've just knocked together a very rough plan here:
> https://github.com/okfn/oed/issues/1
>
> <quote>
> ## Do a Trial Run (first 20 pages of vol 1)
>
> * [x] Locate scans #2 <https://github.com/okfn/oed/issues/2>
> * [x] Get hold of OCR of volume 1 - #3<https://github.com/okfn/oed/issues/3>- In Progress - we can grab Internet Archive versions e.g.
> https://ia600401.us.archive.org/7/items/oed01arch/oed01arch_abbyy.gz
> * [ ] Check the OCR and proof-edit - #4<https://github.com/okfn/oed/issues/4>
>
> * [ ] Based on this experience work how we scale
> </quote>
>
> Right now the place we need to do work is getting usable OCR output. As documented
> in the issue <https://github.com/okfn/oed/issues/3> my guess of the steps
> is:
>
> * Grab the XML from Internet Archive
> * Gunzip
> * Convert the massive XML to something more manageable, suggest both
> * Chop into per page
> * Go from full Abby XML to something better
> * Upload the results somewhere shareable - e.g. assets.okfnlabs.org/p/oed
> * Move on to #4 (review and proof)
>
>
>> If there were any simple tasks that people could do, Adam said he could
>> help publicise this to readers of the Public Domain Review.
>>
>
> That would be great. As Tom Morris pointed out there's a neat firefox
> extension that allows you to OCR correction in the browser. There are
> plenty of other neat approaches we could try from etherpads to browser
> extensions and git repos ...
>
> Rufus
>
>
>>
>> [1] https://github.com/okfn/oed
>>
> *
>
> *
>
--
Jonathan Gray
Director of Policy and Ideas | *@jwyg <https://twitter.com/jwyg>*
The Open Knowledge Foundation <http://okfn.org/>
*
Empowering through Open Knowledge
okfn.org | @okfn <http://twitter.com/OKFN> | OKF on
Facebook<https://www.facebook.com/OKFNetwork> |
Blog <http://blog.okfn.org/> | Newsletter<http://okfn.org/about/newsletter>
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/pd-discuss/attachments/20130826/7fce4542/attachment.html>
More information about the pd-discuss
mailing list