[humanities-dev] Shakespeare Annotations

Nick Stenning nick at whiteink.com
Wed Apr 18 15:36:39 UTC 2012

I think this may end up proving very useful. I'm embarrassed to say I
didn't even know there were diff utilities in the standard library...


On 14 Apr 2012, at 09:28, Pedro Markun <pedro at esfera.mobi> wrote:


kinda hackish, but I've been using difflib:


To accomplish something similar in a project of ours...

Also if you run in on a really blunt approach as something like the 'first
two words, ending two words'... isn't it enough? Anything that don't get
picked by that could be inserted manually then.

Pedro Markun

On Thu, Apr 12, 2012 at 3:13 PM, Nick Stenning <nick.stenning at okfn.org>wrote:

> Ok, so I've now had a look at these properly. Here's where I'm at:
> As far as I can tell, the texts you've used are almost identical to the
> Moby texts. The key word here is "almost". Unfortunately, the fact they're
> not identical rules out a simple conversion from [FinalsClub HTML
> rendering] -> [character offsets] -> [Open Shakespeare HTML rendering].
> It's their similarity, however, that would make it all the more tragic if
> we simply created another edition of each play and added your annotations
> on those, rather than attempting to display them on the same texts that our
> people have annotated.
> So, what I have to do, somehow, is establish a mapping from [character
> offset in FC edition] to [character offset in OS/Moby edition]. I don't
> know of any tools that exist to do this, but I have some ideas of how it
> could be done which I will code up if I need to.
> As far as the annotations themselves go, there a few issues that need
> resolving before they can be used. Here's a slightly cleaned up (and
> truncated) version of one of the annotations:
>     {
>       "text": " '\"Hecate\" is also ... scene i). '",
>       "uri": "The Tragedy of Macbeth 7.html",
>        "ranges": [{
>         "start": "/span[19]",
>         "end": "/span[20]",
>         "startOffset": 49,
>         "endOffset": 55
>       }],
>       "quote": " 'HECATE ",
>       "finalsclub_id": 5029
>     }
> As you can see, it apparently starts in '/span[19]' and ends in
> '/span[20]', which can't possibly be true given it contains just the text
> "HECATE", and each span represents an entire scene. This appears to be the
> case for all the annotations: they always end at least in the next scene!
> Now, I could just subtract 1 from the index of each "end" xpath, but it
> would be good if David could have a look in his code and see if this is the
> right thing to do.
> In addition, there's a lot of odd quoting going on in the "quote" and
> "text" fields -- but that's relatively easy for me to fix up. Ditto
> "ranges" being an array, not a single object.
> Anyway, the action points from this email are:
> 1) Anyone reading this who knows of tools to fuzzily align similar texts:
> please let us know.
> 2) David: could you check the code that generates these XPaths?
> Best,
> Nick
> _______________________________________________
> humanities-dev mailing list
> humanities-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/humanities-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/humanities-dev/attachments/20120418/d3ccdc7d/attachment.html>

More information about the humanities-dev mailing list