I have been setting up
SOLR (version 1.3) as a search index for the
Darwin Correspondence project. While making a few changes I ran into a really annoying issue today related to the way the schema configuration works. The
SOLR schema (schema.xml) allows you to setup
Analyzers and Filters which allow control of how terms are indexed and searches are executed.
I needed to make it so we could match names when the case is not exact and when the chars are special (i.e. "u" needs to match a name with "ΓΌ"). The field started out like this:
<fieldType name="name" class="solr.StrField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true">
For my first attempt I added an analyzer to the field like so:
<fieldType name="name" class="solr.StrField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true">
<analyzer type="index">
<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I loaded data into SOLR and tried out some searches and go no results. I was getting exact matches only (as if I had no analyzers). When I checked the solr admin analysis page it indicated that the filters were working and the tests there even seemed to show that things were ok. Unfortuantely, I found out that SOLR does not actually execute the analyzers if the field class is set to
solr.StrField. It doesn't fail or indicate errors in the logs but your searches will not work the way you expect them to. Changing the field over to class
solr.TextField fixed the problem.
The correct configuration for the field is this:
<fieldType name="name" class="solr.TextField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true">
<analyzer type="index">
<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I spent a few hours figuring this out so I hope that this saves someone a little time.