[humanities-dev] Shakespeare Annotations

Sat Apr 14 08:27:32 UTC 2012

Nick,

kinda hackish, but I've been using difflib:

http://docs.python.org/library/difflib.html

To accomplish something similar in a project of ours...

Also if you run in on a really blunt approach as something like the 'first
two words, ending two words'... isn't it enough? Anything that don't get
picked by that could be inserted manually then.

[]'s
Pedro Markun

On Thu, Apr 12, 2012 at 3:13 PM, Nick Stenning <nick.stenning at okfn.org>wrote:

> Ok, so I've now had a look at these properly. Here's where I'm at:
>
> As far as I can tell, the texts you've used are almost identical to the
> Moby texts. The key word here is "almost". Unfortunately, the fact they're
> not identical rules out a simple conversion from [FinalsClub HTML
> rendering] -> [character offsets] -> [Open Shakespeare HTML rendering].
> It's their similarity, however, that would make it all the more tragic if
> we simply created another edition of each play and added your annotations
> on those, rather than attempting to display them on the same texts that our
> people have annotated.
>
> So, what I have to do, somehow, is establish a mapping from [character
> offset in FC edition] to [character offset in OS/Moby edition]. I don't
> know of any tools that exist to do this, but I have some ideas of how it
> could be done which I will code up if I need to.
>
> As far as the annotations themselves go, there a few issues that need
> resolving before they can be used. Here's a slightly cleaned up (and
> truncated) version of one of the annotations:
>
>     {
>       "text": " '\"Hecate\" is also ... scene i). '",
>       "uri": "The Tragedy of Macbeth 7.html",
>        "ranges": [{
>         "start": "/span[19]",
>         "end": "/span[20]",
>         "startOffset": 49,
>         "endOffset": 55
>       }],
>       "quote": " 'HECATE ",
>       "finalsclub_id": 5029
>     }
>
> As you can see, it apparently starts in '/span[19]' and ends in
> '/span[20]', which can't possibly be true given it contains just the text
> "HECATE", and each span represents an entire scene. This appears to be the
> case for all the annotations: they always end at least in the next scene!
> Now, I could just subtract 1 from the index of each "end" xpath, but it
> would be good if David could have a look in his code and see if this is the
> right thing to do.
>
> In addition, there's a lot of odd quoting going on in the "quote" and
> "text" fields -- but that's relatively easy for me to fix up. Ditto
> "ranges" being an array, not a single object.
>
> Anyway, the action points from this email are:
>
> 1) Anyone reading this who knows of tools to fuzzily align similar texts:
> please let us know.
> 2) David: could you check the code that generates these XPaths?
>
> Best,
> Nick
>
> _______________________________________________
> humanities-dev mailing list
> humanities-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/humanities-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/humanities-dev/attachments/20120414/3953b51d/attachment.html>