[annotator-dev] working on fm

Sat Mar 17 04:04:50 UTC 2012

On Fri, Mar 16, 2012 at 2:09 AM, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> On 16 March 2012 02:02, Adam Hyde <adam.hyde at sourcefabric.org> wrote:
>>
>> Hi
>>
>> Thanks Rufus that is very interesting. How would it effect things if the
>> read and write environment was the same? Is is possible to change the xpath
>> identifier dynamically as the text gets altered?
>
> That's exactly what you'd need to do and I was suggesting would be needed :-)
>
> The other point I was making was is that if you control the text (not
> the case in many of annotators use cases e.g. bookmarklet or
> OpenShakespeare) you can start inserting lots of special tags and
> possibly utilizing those as your addressing points (though you would
> probably need to change annotator a bit ...)
>
> I have a feeling there is a reasonble 80/20 solution out there.

Having just got back from an Open Annotation WG meeting (Open
Annotation Collaboration + Annotation Ontology), I'll chime in with
what that data model is starting to look like and how it impacts this
problem.

The data model has two notions of target.
1) Simple targets are just URIs. These are refered to as a "Target".
2) More complex selections of document segments are referred to as
"Specific Targets".

Specific targets have a source document URI and 0 or more selectors.
Taking annotator in this direction would mean nesting the ranges as
properties of the target, also in recognition that of the idea that
annotations may refer to multiple targets simultaneously (specific or
non-specific).

In that working group's model, the range, as annotator calls it, is
actually a particular class of selector, and there may be many.
I find this idea extremely compelling as a way to address the problem
of changing text.
In our conversation these past two days, we decided to make no attempt
to recommend a priority or otherwise specify the relationship between
multiple selectors.

in other words, we could store information that uniquely identifies
the paragraph, such as an id attribute, when applicable. We could also
store the current range XPath/Offset stuff. We could also store a
textual context (prefix + suffix text), or anything else we can come
up with. I'm fairly convinced that if we designate the selection in
multiple formats it means we can say with some confidence that we have
successfully re-anchored the annotation when text has changed if some
subset of the selectors still finds a match. So while the id attribute
may not match, or the XPath may have changed, there may still be a
textual match.

I hope that doesn't sound like an explosion of complexity. It can be
kept quite simple. For example, the current annotator JSON is 1-1
compatible just by moving the ranges down into the target (or not,
after all, there's nothing about JSON that suggests a particular
mapping to concepts in a formal vocabulary... ultimately we do what we
want). I'm just offering possibilities for inspiration.

-Randall

>
> rufus
>
>> Adam
>>
>> Adam Hyde
>> Booktype Project Lead
>>
>> On 15 Mar 2012 20:56, "Rufus Pollock" <rufus.pollock at okfn.org> wrote:
>>
>> On 15 March 2012 18:06, Adam Hyde <adam.hyde at sourcefabric.org> wrote:
>>> hi
>>>
>>> I installed it on FLO...
>>
>> We've thought a lot about this (and there's ongoing thought about this
>> as part of related Textus project [1]). Summary from my point of view:
>>
>> [1]: http://textusproject.org/
>>
>> * Ultimately to handle text changes you have to do some kind of
>> migration of annotation references.
>> * Annotation addresses are based on a pointer to some fixed identifier
>> in the text + character offset from there. E.g. identifier could be
>> element id, paragraph id, xpath etc (note an xpath identifier is
>> really just a special kind of offset ...).
>> * The more atomic (i.e. smaller the area they cover) your addresses
>> the less is your migration problems (but the worse your interference
>> with the text) [2]
>>
>>  * In essence your migration will run an algorithm such as the following
>>
>>  * Compare two texts.
>>   * For all annotations with atomic sections whose identifier and
>> content is unchanged we need do nothing
>>   * For all sections whose identifier has changed but whose content
>> is unchanged update the relevant annotation identifiers (note it could
>> be difficult to work out the changes in identifiers to make this
>> possible -- e.g. suppose you have cut and pasted one paragraph in a
>> document. This will change all xpaths following the cut section and
>> before it's reinsertion)
>>   * For all sections with changed content update the offsets
>>
>> [2]: http://blog.okfn.org/2007/01/24/thinking-about-annotation/
>>
>> In general this shows that identifiers which are tied to paths in
>> document are especially bad. However they are one of the *only*
>> options if you can't interfere with the original document (e.g. by
>> inserting your own ids!) -- the other option i know of here is to do
>> hashing of small string sections of the document to generate your
>> identifiers. This does not require interfering with your document but
>> generates addresses into the document. However, it is computation-ly
>> costly and more fragile to character changes.
>>
>> Thus, one extreme option, that would make updating significantly
>> easier, but which requires you have complete control of your html text
>> is to insert identifier marks (e.g. in html <span id="{id}"></span>),
>> say, every sentence and configure the annotator to utilize these ids
>> when generation uri's for annotations ...
>>
>> Rufus
>
>
>
> --
> Co-Founder, Open Knowledge Foundation
> Promoting Open Knowledge in a Digital Age
> http://www.okfn.org/ - http://blog.okfn.org/
>
> _______________________________________________
> annotator-dev mailing list
> annotator-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/annotator-dev