[annotator-dev] working on fm

Rufus Pollock rufus.pollock at okfn.org
Thu Mar 15 19:56:11 UTC 2012


On 15 March 2012 18:06, Adam Hyde <adam.hyde at sourcefabric.org> wrote:
> hi
>
> I installed it on FLOSS Manuals as a trial. Seems to work well:
> http://booki.flossmanuals.net/a-webpage-is-a-book/_draft
>
> Will trial it for a bit. I wrote a short blog post about it:
> http://blog.booki.cc/
>
> Also, what are the plans for annotations on changing text? I know its
> a tricky issue - any strategies? We were considering trying to get
> TinyMCE to generate unique IDs per P tag...sort of a purple number
> idea...dont know if that would be a good strategy or not as Im not too
> close to annotation thinking...any ideas?

We've thought a lot about this (and there's ongoing thought about this
as part of related Textus project [1]). Summary from my point of view:

[1]: http://textusproject.org/

* Ultimately to handle text changes you have to do some kind of
migration of annotation references.
* Annotation addresses are based on a pointer to some fixed identifier
in the text + character offset from there. E.g. identifier could be
element id, paragraph id, xpath etc (note an xpath identifier is
really just a special kind of offset ...).
* The more atomic (i.e. smaller the area they cover) your addresses
the less is your migration problems (but the worse your interference
with the text) [2]

 * In essence your migration will run an algorithm such as the following

 * Compare two texts.
   * For all annotations with atomic sections whose identifier and
content is unchanged we need do nothing
   * For all sections whose identifier has changed but whose content
is unchanged update the relevant annotation identifiers (note it could
be difficult to work out the changes in identifiers to make this
possible -- e.g. suppose you have cut and pasted one paragraph in a
document. This will change all xpaths following the cut section and
before it's reinsertion)
   * For all sections with changed content update the offsets

[2]: http://blog.okfn.org/2007/01/24/thinking-about-annotation/

In general this shows that identifiers which are tied to paths in
document are especially bad. However they are one of the *only*
options if you can't interfere with the original document (e.g. by
inserting your own ids!) -- the other option i know of here is to do
hashing of small string sections of the document to generate your
identifiers. This does not require interfering with your document but
generates addresses into the document. However, it is computation-ly
costly and more fragile to character changes.

Thus, one extreme option, that would make updating significantly
easier, but which requires you have complete control of your html text
is to insert identifier marks (e.g. in html <span id="{id}"></span>),
say, every sentence and configure the annotator to utilize these ids
when generation uri's for annotations ...

Rufus




More information about the annotator-dev mailing list