[open-humanities] [okfn-labs] Best practice for OCR workflows (re OED 1st edition project)

Tue Aug 27 10:10:16 UTC 2013

Super - it would be great to have your help Dan!

To start coordinating next steps, suggest anyone interested in helping out
adds their name to this Github page - and then we can take things from
there:

https://github.com/okfn/oed/issues/5

J.

On 26 August 2013 15:57, dan o'huiginn <ohuiginn at gmail.com> wrote:

> "It would be really cool if there were any way to break the dictionary
> down into entries that people could help to proofread and correct. Any
> thoughts on that front?"
>
> The good thing about dictionary entries is that they are all in order. So
> headwords are words that appear at the start of a column, and which share
> their first few letters with many other words on the page. If we can also
> detect the larger/bolder text, so much the better.
>
> So I'd suggest we:
>  - guess headwords present on each page, as above
>  - provisionally break up the OCR'd text, based on what is between two
> headwords (may break, depending on how well the OCR has handled columns)
>  - make an approximate selection of the boundaries of text for an entry
>
> Then we will have an approximate list of all the headwords in the OED,
> with page number and location on the page.
>
> We should then be able to present a page for each word, containing:
>  - an image of the page, centred on the correct headword. The user should
> be able to move and redraw the boundaries if we got that wrong.
>  - OCR'd text, with the possibility to correct it
>
> At this point we would have a useful-but-cumbersome system -- it would
> already be one of the most comprehensive free dictionaries online.
>
> Incidentally, let me say I'm very excited about this project. I've been
> idly poking at the OED for years, but never had enough time/commitment to
> produce anything usable out of it. But with a little momentum, I believe
> this project could get a long way quite quickly.
>
> Dan
>
>
>
> On Mon, Aug 26, 2013 at 11:10 AM, Jonathan Gray <jonathan.gray at okfn.org>wrote:
>
>> Forwarding James's reply, which initially bounced...
>>
>> On 26 August 2013 09:25, James Cummings <james.cummings at it.ox.ac.uk>wrote:
>>
>>>  Hi all,
>>>
>>>  Dare I suggest that the TEI dictionaries module has been created for
>>> just this sort of data?
>>> Thid can be as detailed or simplistic as one needs and more importantly
>>> one can document exactly how one is using the TEI in a machine processable
>>> customisation file.  Having done similar conversions before suggests it
>>> will all depend on whether there is sufficient granularity of markup in the
>>> OCR conversion of the presentational markup from the original source.
>>>
>>>  James
>>>
>>>  --
>>> Dr James Cummings, Academic IT Services, University of Oxford
>>>
>>>
>>>
>>> -------- Original message --------
>>> From: Rufus Pollock <rufus.pollock at okfn.org>
>>> Date: 2013/08/26 03:28 (GMT+00:00)
>>> To: Jonathan Gray <jonathan.gray at okfn.org>
>>> Cc: okfn-labs <okfn-labs at lists.okfn.org>,open-humanities <
>>> open-humanities at lists.okfn.org>,Adam Green <adam.green at okfn.org>,Public
>>> Domain discuss list <pd-discuss at lists.okfn.org>
>>> Subject: Re: [open-humanities] [okfn-labs] Best practice for OCR
>>> workflows (re OED 1st edition project)
>>>
>>>
>>>  On 24 August 2013 11:28, Jonathan Gray <jonathan.gray at okfn.org> wrote:
>>>
>>>> Regarding plans for an open version of the 1st edition of the OED, I
>>>> thought some of you might be interested in this piece from Cory Doctorow
>>>> yesterday:
>>>>
>>>>
>>>> http://www.theguardian.com/technology/2013/aug/23/oxford-english-dictionary-future-digitally
>>>>
>>>>  What do we need to move forward with an Open OED project [1]? It
>>>> would be really cool if there were any way to break the dictionary down
>>>> into entries that people could help to proofread and correct. Any thoughts
>>>> on that front? Anyone else interested in helping?
>>>>
>>>
>>>  Right now it would not need a lot to move forward - a small amount of
>>> time / effort could probably result in headway being made :-) Right now
>>> (see plan below) what's most needed is *someone to spend an hour or two*getting a massive XML OCR output into a more useable form. I'm a bit
>>> time-constrained at present so it probably won't be me immediately ;-) but
>>> I've put together an outline of what you'd probably want to do<https://github.com/okfn/oed/issues/3> -
>>> most of inlined below (advice here warmly welcomed - I'm something of a
>>> novice with the OCR and Abby XML stuff!)
>>>
>>>  *More detail ...*
>>>
>>>  Thanks to Tom Morris and others we now know where we can get the Abby
>>> OCR versions the Internet Archive automatically make - see the thread in
>>> the Ideas issue here:
>>>
>>>  https://github.com/okfn/ideas/issues/50
>>>
>>>  I've just knocked together a very rough plan here:
>>> https://github.com/okfn/oed/issues/1
>>>
>>>  <quote>
>>>  ## Do a Trial Run (first 20 pages of vol 1)
>>>
>>>  * [x] Locate scans #2 <https://github.com/okfn/oed/issues/2>
>>> * [x] Get hold of OCR of volume 1 - #3<https://github.com/okfn/oed/issues/3>- In Progress - we can grab Internet Archive versions e.g.
>>> https://ia600401.us.archive.org/7/items/oed01arch/oed01arch_abbyy.gz
>>> * [ ] Check the OCR and proof-edit - #4<https://github.com/okfn/oed/issues/4>
>>>
>>> * [ ] Based on this experience work how we scale
>>>  </quote>
>>>
>>>  Right now the place we need to do work is getting usable OCR output.
>>> As documented in the issue <https://github.com/okfn/oed/issues/3> my
>>> guess of the steps is:
>>>
>>>  * Grab the XML from Internet Archive
>>> * Gunzip
>>> * Convert the massive XML to something more manageable, suggest both
>>>   * Chop into per page
>>>   * Go from full Abby XML to something better
>>> * Upload the results somewhere shareable - e.g.
>>> assets.okfnlabs.org/p/oed
>>> * Move on to #4 (review and proof)
>>>
>>>
>>>>  If there were any simple tasks that people could do, Adam said he
>>>> could help publicise this to readers of the Public Domain Review.
>>>>
>>>
>>>  That would be great. As Tom Morris pointed out there's a neat firefox
>>> extension that allows you to OCR correction in the browser. There are
>>> plenty of other neat approaches we could try from etherpads to browser
>>> extensions and git repos ...
>>>
>>>  Rufus
>>>
>>>
>>>>
>>>>  [1] https://github.com/okfn/oed
>>>>
>>>  *
>>>
>>>  *
>>>
>>
>>
>>
>> --
>>
>> Jonathan Gray
>>
>> Director of Policy and Ideas  | *@jwyg <https://twitter.com/jwyg>*
>>
>> The Open Knowledge Foundation <http://okfn.org/>
>> *
>>
>> Empowering through Open Knowledge
>>
>> okfn.org  |  @okfn <http://twitter.com/OKFN>  |  OKF on Facebook<https://www.facebook.com/OKFNetwork> |
>> Blog <http://blog.okfn.org/>  |  Newsletter<http://okfn.org/about/newsletter>
>> *
>>
>> _______________________________________________
>> open-humanities mailing list
>> open-humanities at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-humanities
>> Unsubscribe: http://lists.okfn.org/mailman/options/open-humanities
>>
>>
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>

-- 

Jonathan Gray

Director of Policy and Ideas  | *@jwyg <https://twitter.com/jwyg>*

The Open Knowledge Foundation <http://okfn.org/>
*

Empowering through Open Knowledge

okfn.org  |  @okfn <http://twitter.com/OKFN>  |  OKF on
Facebook<https://www.facebook.com/OKFNetwork> |
Blog <http://blog.okfn.org/>  |  Newsletter<http://okfn.org/about/newsletter>
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-humanities/attachments/20130827/01ec5dc1/attachment-0001.html>