[okfn-labs] [open-humanities] Best practice for OCR workflows (re OED 1st edition project)

Mon Aug 26 13:57:01 UTC 2013

"It would be really cool if there were any way to break the dictionary down
into entries that people could help to proofread and correct. Any thoughts
on that front?"

The good thing about dictionary entries is that they are all in order. So
headwords are words that appear at the start of a column, and which share
their first few letters with many other words on the page. If we can also
detect the larger/bolder text, so much the better.

So I'd suggest we:
 - guess headwords present on each page, as above
 - provisionally break up the OCR'd text, based on what is between two
headwords (may break, depending on how well the OCR has handled columns)
 - make an approximate selection of the boundaries of text for an entry

Then we will have an approximate list of all the headwords in the OED, with
page number and location on the page.

We should then be able to present a page for each word, containing:
 - an image of the page, centred on the correct headword. The user should
be able to move and redraw the boundaries if we got that wrong.
 - OCR'd text, with the possibility to correct it

At this point we would have a useful-but-cumbersome system -- it would
already be one of the most comprehensive free dictionaries online.

Incidentally, let me say I'm very excited about this project. I've been
idly poking at the OED for years, but never had enough time/commitment to
produce anything usable out of it. But with a little momentum, I believe
this project could get a long way quite quickly.

Dan

On Mon, Aug 26, 2013 at 11:10 AM, Jonathan Gray <jonathan.gray at okfn.org>wrote:

> Forwarding James's reply, which initially bounced...
>
> On 26 August 2013 09:25, James Cummings <james.cummings at it.ox.ac.uk>wrote:
>
>>  Hi all,
>>
>>  Dare I suggest that the TEI dictionaries module has been created for
>> just this sort of data?
>> Thid can be as detailed or simplistic as one needs and more importantly
>> one can document exactly how one is using the TEI in a machine processable
>> customisation file.  Having done similar conversions before suggests it
>> will all depend on whether there is sufficient granularity of markup in the
>> OCR conversion of the presentational markup from the original source.
>>
>>  James
>>
>>  --
>> Dr James Cummings, Academic IT Services, University of Oxford
>>
>>
>>
>> -------- Original message --------
>> From: Rufus Pollock <rufus.pollock at okfn.org>
>> Date: 2013/08/26 03:28 (GMT+00:00)
>> To: Jonathan Gray <jonathan.gray at okfn.org>
>> Cc: okfn-labs <okfn-labs at lists.okfn.org>,open-humanities <
>> open-humanities at lists.okfn.org>,Adam Green <adam.green at okfn.org>,Public
>> Domain discuss list <pd-discuss at lists.okfn.org>
>> Subject: Re: [open-humanities] [okfn-labs] Best practice for OCR
>> workflows (re OED 1st edition project)
>>
>>
>>  On 24 August 2013 11:28, Jonathan Gray <jonathan.gray at okfn.org> wrote:
>>
>>> Regarding plans for an open version of the 1st edition of the OED, I
>>> thought some of you might be interested in this piece from Cory Doctorow
>>> yesterday:
>>>
>>>
>>> http://www.theguardian.com/technology/2013/aug/23/oxford-english-dictionary-future-digitally
>>>
>>>  What do we need to move forward with an Open OED project [1]? It would
>>> be really cool if there were any way to break the dictionary down into
>>> entries that people could help to proofread and correct. Any thoughts on
>>> that front? Anyone else interested in helping?
>>>
>>
>>  Right now it would not need a lot to move forward - a small amount of
>> time / effort could probably result in headway being made :-) Right now
>> (see plan below) what's most needed is *someone to spend an hour or two*getting a massive XML OCR output into a more useable form. I'm a bit
>> time-constrained at present so it probably won't be me immediately ;-) but
>> I've put together an outline of what you'd probably want to do<https://github.com/okfn/oed/issues/3> -
>> most of inlined below (advice here warmly welcomed - I'm something of a
>> novice with the OCR and Abby XML stuff!)
>>
>>  *More detail ...*
>>
>>  Thanks to Tom Morris and others we now know where we can get the Abby
>> OCR versions the Internet Archive automatically make - see the thread in
>> the Ideas issue here:
>>
>>  https://github.com/okfn/ideas/issues/50
>>
>>  I've just knocked together a very rough plan here:
>> https://github.com/okfn/oed/issues/1
>>
>>  <quote>
>>  ## Do a Trial Run (first 20 pages of vol 1)
>>
>>  * [x] Locate scans #2 <https://github.com/okfn/oed/issues/2>
>> * [x] Get hold of OCR of volume 1 - #3<https://github.com/okfn/oed/issues/3>- In Progress - we can grab Internet Archive versions e.g.
>> https://ia600401.us.archive.org/7/items/oed01arch/oed01arch_abbyy.gz
>> * [ ] Check the OCR and proof-edit - #4<https://github.com/okfn/oed/issues/4>
>>
>> * [ ] Based on this experience work how we scale
>>  </quote>
>>
>>  Right now the place we need to do work is getting usable OCR output. As documented
>> in the issue <https://github.com/okfn/oed/issues/3> my guess of the
>> steps is:
>>
>>  * Grab the XML from Internet Archive
>> * Gunzip
>> * Convert the massive XML to something more manageable, suggest both
>>   * Chop into per page
>>   * Go from full Abby XML to something better
>> * Upload the results somewhere shareable - e.g. assets.okfnlabs.org/p/oed
>> * Move on to #4 (review and proof)
>>
>>
>>>  If there were any simple tasks that people could do, Adam said he
>>> could help publicise this to readers of the Public Domain Review.
>>>
>>
>>  That would be great. As Tom Morris pointed out there's a neat firefox
>> extension that allows you to OCR correction in the browser. There are
>> plenty of other neat approaches we could try from etherpads to browser
>> extensions and git repos ...
>>
>>  Rufus
>>
>>
>>>
>>>  [1] https://github.com/okfn/oed
>>>
>>  *
>>
>>  *
>>
>
>
>
> --
>
> Jonathan Gray
>
> Director of Policy and Ideas  | *@jwyg <https://twitter.com/jwyg>*
>
> The Open Knowledge Foundation <http://okfn.org/>
> *
>
> Empowering through Open Knowledge
>
> okfn.org  |  @okfn <http://twitter.com/OKFN>  |  OKF on Facebook<https://www.facebook.com/OKFNetwork> |
> Blog <http://blog.okfn.org/>  |  Newsletter<http://okfn.org/about/newsletter>
> *
>
> _______________________________________________
> open-humanities mailing list
> open-humanities at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-humanities
> Unsubscribe: http://lists.okfn.org/mailman/options/open-humanities
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130826/96a3dafd/attachment-0002.html>