Friday, June 12, 2009

Tricky SOLR schema issue with StrField

I have been setting up SOLR (version 1.3) as a search index for the Darwin Correspondence project. While making a few changes I ran into a really annoying issue today related to the way the schema configuration works. The SOLR schema (schema.xml) allows you to setup Analyzers and Filters which allow control of how terms are indexed and searches are executed.

I needed to make it so we could match names when the case is not exact and when the chars are special (i.e. "u" needs to match a name with "ΓΌ"). The field started out like this:
<fieldType name="name" class="solr.StrField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true">
For my first attempt I added an analyzer to the field like so:
<fieldType name="name" class="solr.StrField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true">
<analyzer type="index">
<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I loaded data into SOLR and tried out some searches and go no results. I was getting exact matches only (as if I had no analyzers). When I checked the solr admin analysis page it indicated that the filters were working and the tests there even seemed to show that things were ok. Unfortuantely, I found out that SOLR does not actually execute the analyzers if the field class is set to solr.StrField. It doesn't fail or indicate errors in the logs but your searches will not work the way you expect them to. Changing the field over to class solr.TextField fixed the problem.
The correct configuration for the field is this:
<fieldType name="name" class="solr.TextField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true">
<analyzer type="index">
<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>

I spent a few hours figuring this out so I hope that this saves someone a little time.

3 comments:

Shalin Shekhar Mangar said...

Aaron, I agree that this violates the principle of least surprise and we should at least log a warning. Can you please open an issue on the Solr jira?

R.L. said...

Hi Aaron,

I am running into the exact same issue that you had. My analyzers work fine in the solr/admin/analysis.jsp. However, the response I get when I query shows that the analyzers have not been run. I have tried using solrJ and posting xml with solr's own post.jar, but nothing seems to work.

I have set my fieldtypes to solr.TextField. Do you have any insight why I'm still running into this problem?

Here is my fieldtype (pretend parenthsis are "<>":

(fieldType name="keywords" class="solr.TextField" sortMissingLast="true" omitNorms="true")

(analyzer type="index")
(tokenizer class="solr.WhitespaceTokenizerFactory" /)
(filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords-ch.txt" /)
(filter class="solr.TrimFilterFactory" /)
(/analyzer)
(/fieldType)

CicciManolesta said...

I have same problems with Solr 1.4 and I've changed STrField to Text Field