[annotator-dev] Question about offset

Wed May 9 12:47:00 UTC 2012

Hi Sebastian,

Reply inline below.

On Tue, May 8, 2012 at 9:49 AM, Sebastian Hellmann
<hellmann at informatik.uni-leipzig.de> wrote:
> Hi again,
>
> [...]
>
> We encountered quite a big problem however:
> In the underlying HTML characters are, e.g. "Qualität" , but when you
> get the selection via Javascript (e.g.  window.getSelection() or innerHTML()
> ) the escaping is already resolved and it is normally impossible to get the
> HTML as is with Javascript.

Yes, the offset calculations performed by Annotator refer to the
"normalised" representation of the DOM, which is not the same thing as
the HTML source. Rob Sanderson of the OAC had a similar question a
couple of weeks ago and you can see my response to his email here:

-----

Hi Robert,

Sorry for the delay in getting back to you. Offsets are calculated in
Annotator as the sum of the lengths of the relevant text nodes in the
DOM. How this is actually calculated is presumably covered by the DOM
spec, but what we do is split a text node at the point where an
annotation starts or end, and then count the length of all those text
nodes preceding it.

So, by way of example, with a DOM that starts looking like this:

    <p><textNode nodeValue=3D'To be or not to be: that is the question'></p=
>

Say someone selected 'not to be'. Annotator would slice the DOM until
it looked like this:

    <p>
      <textNode nodeValue=3D'To be or '>
      <textNode nodeValue=3D'not to be'>
      <textNode nodeValue=3D': that is the question'>
    <p>

Then the offsets would calculated as follows:

   nodes =3D para.childNodes
   startOffset =3D nodes[0].nodeValue.length
   endOffset =3D startOffset + nodes[1].nodeValue.length

Lastly, I'll point out that in practice that means that:

  a) HTML entities are converted to their ASCII/Unicode equivalents
and are counted as a single character
  b) HTML tags are ignored -- only TextNodes are involved in the offset

Hope that answers your questions.

Best,
Nick

On Fri, Apr 20, 2012 at 16:21, Robert Sanderson <azaroth42 at gmail.com> wrote=
:
> Hi guys,
>
> Just a quick question about the Annotate-It text selection mechanism...
>
> How do you determine the number of characters to count in? For
> example, does an html entity count as 1 character or as the actual
> number of characters that make up the entity? Thus, does "&" count
> for 1 or 5? =C2=A0And the same for HTML tags?
>
> And the same questions for quoting the exact phrase. Do you record
> tags and entities, or just the extracted plain text?
>
>
> Many thanks!
>
> Rob

-----

See also the discussion of offset character counting in the draft OAC spec:

  http://www.openannotation.org/spec/extension/#SelectorOffset

>From a technical point of view, this will require that if you do the
character-counting server-side, you may well need to include an
implementation of large parts of the W3C DOM specification. This is a
bit of a pain, for sure, but such libraries are available for most
languages now:

Ruby:            http://nokogiri.org/

Python:          http://docs.python.org/library/htmlparser.html
                 http://lxml.de/

Javascript/Node: https://github.com/tmpvar/jsdom

PHP:             http://php.net/manual/en/book.dom.php

Best wishes,
Nick