[humanities-dev] Shakespeare Annotations
Nick Stenning
nick at whiteink.com
Wed Apr 18 15:36:39 UTC 2012
I think this may end up proving very useful. I'm embarrassed to say I
didn't even know there were diff utilities in the standard library...
Cheers,
N
On 14 Apr 2012, at 09:28, Pedro Markun <pedro at esfera.mobi> wrote:
Nick,
kinda hackish, but I've been using difflib:
http://docs.python.org/library/difflib.html
To accomplish something similar in a project of ours...
Also if you run in on a really blunt approach as something like the 'first
two words, ending two words'... isn't it enough? Anything that don't get
picked by that could be inserted manually then.
[]'s
Pedro Markun
On Thu, Apr 12, 2012 at 3:13 PM, Nick Stenning <nick.stenning at okfn.org>wrote:
> Ok, so I've now had a look at these properly. Here's where I'm at:
>
> As far as I can tell, the texts you've used are almost identical to the
> Moby texts. The key word here is "almost". Unfortunately, the fact they're
> not identical rules out a simple conversion from [FinalsClub HTML
> rendering] -> [character offsets] -> [Open Shakespeare HTML rendering].
> It's their similarity, however, that would make it all the more tragic if
> we simply created another edition of each play and added your annotations
> on those, rather than attempting to display them on the same texts that our
> people have annotated.
>
> So, what I have to do, somehow, is establish a mapping from [character
> offset in FC edition] to [character offset in OS/Moby edition]. I don't
> know of any tools that exist to do this, but I have some ideas of how it
> could be done which I will code up if I need to.
>
> As far as the annotations themselves go, there a few issues that need
> resolving before they can be used. Here's a slightly cleaned up (and
> truncated) version of one of the annotations:
>
> {
> "text": " '\"Hecate\" is also ... scene i). '",
> "uri": "The Tragedy of Macbeth 7.html",
> "ranges": [{
> "start": "/span[19]",
> "end": "/span[20]",
> "startOffset": 49,
> "endOffset": 55
> }],
> "quote": " 'HECATE ",
> "finalsclub_id": 5029
> }
>
> As you can see, it apparently starts in '/span[19]' and ends in
> '/span[20]', which can't possibly be true given it contains just the text
> "HECATE", and each span represents an entire scene. This appears to be the
> case for all the annotations: they always end at least in the next scene!
> Now, I could just subtract 1 from the index of each "end" xpath, but it
> would be good if David could have a look in his code and see if this is the
> right thing to do.
>
> In addition, there's a lot of odd quoting going on in the "quote" and
> "text" fields -- but that's relatively easy for me to fix up. Ditto
> "ranges" being an array, not a single object.
>
> Anyway, the action points from this email are:
>
> 1) Anyone reading this who knows of tools to fuzzily align similar texts:
> please let us know.
> 2) David: could you check the code that generates these XPaths?
>
> Best,
> Nick
>
> _______________________________________________
> humanities-dev mailing list
> humanities-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/humanities-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/humanities-dev/attachments/20120418/d3ccdc7d/attachment.html>
More information about the humanities-dev
mailing list