[annotator-dev] default analyzer for tags field

Gergely, Ujvari ujvari at hypothes.is
Thu Aug 1 14:37:48 UTC 2013


Hi Andrew!

Basically just what Randall has written. In our use case (and it seems
as in yours too) we don't want to have stopwords for the tag index.
I personally think that setting it to not_analyzed would be better for
tags, but I can go with the second proposed solution as well. (Setting
up our own analyzer and a tokenizer)

Thank you,
Gergely
 

2013.08.01. 2:34 keltezéssel, Randall Leeds írta:
> I believe he's suggesting to use an analyzer without stop words.
>
> An easy solution would be to set it to not_analyzed, which is still
> search-able as opaque tokens. It means that searches for "foo" would
> not turn up things tagged "foo bar" but that might be the right idea.
>
> Alternatively, we could set up an analyzer with a tokenizer special
> purpose so that, for example, "foo_bar" and "foo bar" would be the
> same, and would both be indexed as both "foo" and "bar".
>
>
> On Wed, Jul 31, 2013 at 5:06 PM, Andrew Magliozzi
> <andrew at finalsclub.org <mailto:andrew at finalsclub.org>> wrote:
>
>     Hey Gregly,
>
>     Roughly what might you suggest we do to improve upon our current
>     schema?
>
>     Warmly,
>     Andrew
>
>
>
>
>     On Jul 31, 2013, at 6:49 PM, Randall Leeds <tilgovi at hypothes.is
>     <mailto:tilgovi at hypothes.is>> wrote:
>
>>     My guess would be that there was no intention in particular here.
>>
>>
>>     On Tue, Jul 30, 2013 at 4:33 PM, Gergely, Ujvari
>>     <ujvari at hypothes.is <mailto:ujvari at hypothes.is>> wrote:
>>
>>         Hello!
>>
>>         I've a theoretical question about how should the tag index work.
>>
>>         The |tags| field is defined as this in the annotation.py:
>>
>>         |'tags': {'type': 'string', 'index_name': 'tag'}|
>>
>>         But no analyzer was set up for the search, so ES uses it's
>>         own analyzer which by default ignores searches to common
>>         stopwords for example:
>>
>>         |"a", "an", "and", "are", "as", "at", "be", "but", "by",
>>           "for", "if", "in", "into", "is", "it",
>>           "no", "not", "of", "on", "or", "such",
>>           "that", "the", "their", "then", "there", "these",
>>           "they", "this", "to", "was", "will", "with"
>>         |
>>
>>         This means that searching to these stopwords do not give back
>>         search results.
>>
>>         My question: is this an intentional decision to avoid using
>>         trivial tags? If yes, wouldn't it make sense to not let
>>         create this tags if they're not that searchable?
>>
>>         Thanks
>>         Gergely
>>          
>>
>>
>>         _______________________________________________
>>         annotator-dev mailing list
>>         annotator-dev at lists.okfn.org
>>         <mailto:annotator-dev at lists.okfn.org>
>>         http://lists.okfn.org/mailman/listinfo/annotator-dev
>>         Unsubscribe: http://lists.okfn.org/mailman/options/annotator-dev
>>
>>
>>     _______________________________________________
>>     annotator-dev mailing list
>>     annotator-dev at lists.okfn.org <mailto:annotator-dev at lists.okfn.org>
>>     http://lists.okfn.org/mailman/listinfo/annotator-dev
>>     Unsubscribe: http://lists.okfn.org/mailman/options/annotator-dev
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/annotator-dev/attachments/20130801/a01c004e/attachment-0002.html>


More information about the annotator-dev mailing list