[ckan-dev] Issue with new Solr Schema + faceting with extras
Rufus Pollock
rufus.pollock at okfn.org
Thu Dec 8 13:04:14 UTC 2011
2011/12/8 Adrià Mercader <amercadero at gmail.com>:
> Hi all,
> Another step in the long road to Solr enlightenment :)
>
> Recently John introduced some changes in the schema to fix #1455 (show
> extras in search results with all_fields=true)
> https://github.com/okfn/ckan/commit/05b675a4314ad269c6e6a095d57e3f2a21e771eb#diff-0
>
> That worked great, but introduced some changes on the way facets for
> extras were created, which produced some weird results:
> (extras_publishertype should be "primary_source" and extras_filetype
> should be "activity" / "organisation")
>
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">
> <lst name="extras_publishertype">
> <int name="primari">1180</int>
> <int name="primarysourc">1180</int>
> <int name="sourc">1180</int>
> </lst>
> <lst name="extras_filetype">
> <int name="activ">1173</int>
> <int name="organis">7</int>
> </lst>
> </lst>
> <lst name="facet_dates"/>
> </lst>
>
> After spending a while playing with the properties of the "extras_*"
> field, the conclusion is that dynamicField + type="text" does not play
> well with faceting. Two options here:
With facet fields you almost always want string ... (you can try some
slightly fancier stuff e.g. ignore case and strip whitespace but I
don't think it is worth the effort).
Note also that facet fields and query fields almost always want to be
separate in the solr schema for that reason. A standard convention is
to take any standard field and have facet field name {fieldname_facet}
-- since what you want to facet is somewhat configurable this usually
has to be setup in client code. As an example of this see the dynamic
facet field in the openspending schema:
https://github.com/okfn/openspending/blob/master/solr/openspending_schema.xml#L221
(It is OpenSpending client code that creates the contents of these
facet fields of course).
> 1 - Change the type of the extras_* dynamicField to "string", which
> fixes the issue. But the "string" type is much more limited in terms
> of searching (case sensitive, no synonyms... that's why it's used on
> titles, notes, etc.) so the results dependent on "complex" strings in
> extras could lose quality
>
> 2 - Apart from the "extras_*" field, when indexing, we add all extras
> at the main namespace of the package (so in the Solr index there are
> e.g a "publishertype" and "filetype" fields). In our schema there is
This also speaks to the longstanding issue of, at least in the API and
all externally facing interfaces, having the extras in the main
namespace (i.e. not having some 'extra' subdictionary or 'extra'
prefix on the field names. See http://trac.ckan.org/ticket/1240
Rufus
> catch-all dynamicField with type "string" that recognizes these
> fields:
>
> <dynamicField name="*" type="string" indexed="true" stored="false"/>
>
> So if you facet by these fields, you get the expected results:
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">
> <lst name="publishertype">
> <int name="primary_source">1180</int>
> </lst>
> <lst name="filetype">
> <int name="activity">1173</int>
> <int name="organisation">7</int>
> </lst>
> </lst>
> <lst name="facet_dates"/>
> </lst>
>
> This option is a good one because we don't change the schema and
> extras_* still has type="text", but there is a slight chance that if
> some project/ extension is faceting extras they may need to facet by
> "extraname" instead of "extras_extraname". The only one I know of is
> IATI, which I'm happy to change, but there may be another one out
> there.
>
> I would vote option 2 if nobody objects / has an alternative
>
> Sorry for the long email,
>
> Adrià
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev
--
Co-Founder, Open Knowledge Foundation
Promoting Open Knowledge in a Digital Age
http://www.okfn.org/ - http://blog.okfn.org/
More information about the ckan-dev
mailing list