[annotator-dev] working on fm

Adam Hyde adam.hyde at sourcefabric.org
Fri Mar 16 02:02:02 UTC 2012


Thanks Rufus that is very interesting. How would it effect things if the
read and write environment was the same? Is is possible to change the xpath
identifier dynamically as the text gets altered?


Adam Hyde
Booktype Project Lead

On 15 Mar 2012 20:56, "Rufus Pollock" <rufus.pollock at okfn.org> wrote:

On 15 March 2012 18:06, Adam Hyde <adam.hyde at sourcefabric.org> wrote:
> hi
> I installed it on FLO...
We've thought a lot about this (and there's ongoing thought about this
as part of related Textus project [1]). Summary from my point of view:

[1]: http://textusproject.org/

* Ultimately to handle text changes you have to do some kind of
migration of annotation references.
* Annotation addresses are based on a pointer to some fixed identifier
in the text + character offset from there. E.g. identifier could be
element id, paragraph id, xpath etc (note an xpath identifier is
really just a special kind of offset ...).
* The more atomic (i.e. smaller the area they cover) your addresses
the less is your migration problems (but the worse your interference
with the text) [2]

 * In essence your migration will run an algorithm such as the following

 * Compare two texts.
  * For all annotations with atomic sections whose identifier and
content is unchanged we need do nothing
  * For all sections whose identifier has changed but whose content
is unchanged update the relevant annotation identifiers (note it could
be difficult to work out the changes in identifiers to make this
possible -- e.g. suppose you have cut and pasted one paragraph in a
document. This will change all xpaths following the cut section and
before it's reinsertion)
  * For all sections with changed content update the offsets

[2]: http://blog.okfn.org/2007/01/24/thinking-about-annotation/

In general this shows that identifiers which are tied to paths in
document are especially bad. However they are one of the *only*
options if you can't interfere with the original document (e.g. by
inserting your own ids!) -- the other option i know of here is to do
hashing of small string sections of the document to generate your
identifiers. This does not require interfering with your document but
generates addresses into the document. However, it is computation-ly
costly and more fragile to character changes.

Thus, one extreme option, that would make updating significantly
easier, but which requires you have complete control of your html text
is to insert identifier marks (e.g. in html <span id="{id}"></span>),
say, every sentence and configure the annotator to utilize these ids
when generation uri's for annotations ...

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/annotator-dev/attachments/20120316/1662fb20/attachment-0002.html>

More information about the annotator-dev mailing list