[humanities-dev] Shakespeare Annotations

Thu Apr 12 18:43:18 UTC 2012

Ah Nick,

I think you have come across one of our "gotchas". I believe our system was done on word count not character count. 

I'll talk to David about tweaking our data.  Just tell us what we can do to best help you. 

Warmly,
Andrew

On Apr 12, 2012, at 2:13 PM, Nick Stenning <nick.stenning at okfn.org> wrote:

> Ok, so I've now had a look at these properly. Here's where I'm at:
> 
> As far as I can tell, the texts you've used are almost identical to the Moby texts. The key word here is "almost". Unfortunately, the fact they're not identical rules out a simple conversion from [FinalsClub HTML rendering] -> [character offsets] -> [Open Shakespeare HTML rendering]. It's their similarity, however, that would make it all the more tragic if we simply created another edition of each play and added your annotations on those, rather than attempting to display them on the same texts that our people have annotated.
> 
> So, what I have to do, somehow, is establish a mapping from [character offset in FC edition] to [character offset in OS/Moby edition]. I don't know of any tools that exist to do this, but I have some ideas of how it could be done which I will code up if I need to.
> 
> As far as the annotations themselves go, there a few issues that need resolving before they can be used. Here's a slightly cleaned up (and truncated) version of one of the annotations:
> 
>     {
>       "text": " '\"Hecate\" is also ... scene i). '", 
>       "uri": "The Tragedy of Macbeth 7.html", 
>       "ranges": [{
>         "start": "/span[19]", 
>         "end": "/span[20]", 
>         "startOffset": 49, 
>         "endOffset": 55
>       }], 
>       "quote": " 'HECATE ", 
>       "finalsclub_id": 5029
>     }
> 
> As you can see, it apparently starts in '/span[19]' and ends in '/span[20]', which can't possibly be true given it contains just the text "HECATE", and each span represents an entire scene. This appears to be the case for all the annotations: they always end at least in the next scene! Now, I could just subtract 1 from the index of each "end" xpath, but it would be good if David could have a look in his code and see if this is the right thing to do.
> 
> In addition, there's a lot of odd quoting going on in the "quote" and "text" fields -- but that's relatively easy for me to fix up. Ditto "ranges" being an array, not a single object.
> 
> Anyway, the action points from this email are:
> 
> 1) Anyone reading this who knows of tools to fuzzily align similar texts: please let us know.
> 2) David: could you check the code that generates these XPaths?
> 
> Best,
> Nick
> _______________________________________________
> humanities-dev mailing list
> humanities-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/humanities-dev