Wednesday, June 17, 2009

PHP dash in class and method names

I ran into what seems like a common issue when working with PHP and SimpleXML today. Parsing XML is normally pretty easy:
<?php
header('Content-type: text/plain');

$xmlData = <<<XML
<?xml version='1.0'?>
<trees>
<fruit>
<apple name='apple' type='Deciduous' has-fruit='Y' />
<pear name='pear' type='Deciduous' has-fruit='Y' />
</fruit>
<pine>
<white name='whitepine' type='Coniferous' has-fruit='N' />
</pine>
</trees>
XML;

$xml = simplexml_load_string($xmlData);

echo "Testing SimpleXml";
echo "\n".$xmlData;
echo "\nName:".$xml->fruit->apple->getName();
echo " Type:".$xml->fruit->apple->attributes()->type;
?>
Output:
Name:apple Type:Deciduous

However, if you decide to include a hyphen or a dash in the name of your attribute things get a bit more interesting. The code has to be adjusted since the name of a class method cannot contain "-". To make it work, the attribute name has to include braces and single quotes (e.g. "{'name'}").
echo "\n".$xmlData;
echo "\nName:".$xml->fruit->apple->getName();
echo " Type:".$xml->fruit->apple->attributes()->type;
echo "\nFruit?:".$xml->fruit->apple->attributes()->{'has-fruit'};
Output:
Name:apple Type:Deciduous
Fruit?:Y

Friday, June 12, 2009

Tricky SOLR schema issue with StrField

I have been setting up SOLR (version 1.3) as a search index for the Darwin Correspondence project. While making a few changes I ran into a really annoying issue today related to the way the schema configuration works. The SOLR schema (schema.xml) allows you to setup Analyzers and Filters which allow control of how terms are indexed and searches are executed.

I needed to make it so we could match names when the case is not exact and when the chars are special (i.e. "u" needs to match a name with "ΓΌ"). The field started out like this:
<fieldType name="name" class="solr.StrField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true">
For my first attempt I added an analyzer to the field like so:
<fieldType name="name" class="solr.StrField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true">
<analyzer type="index">
<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I loaded data into SOLR and tried out some searches and go no results. I was getting exact matches only (as if I had no analyzers). When I checked the solr admin analysis page it indicated that the filters were working and the tests there even seemed to show that things were ok. Unfortuantely, I found out that SOLR does not actually execute the analyzers if the field class is set to solr.StrField. It doesn't fail or indicate errors in the logs but your searches will not work the way you expect them to. Changing the field over to class solr.TextField fixed the problem.
The correct configuration for the field is this:
<fieldType name="name" class="solr.TextField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true">
<analyzer type="index">
<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>

I spent a few hours figuring this out so I hope that this saves someone a little time.