Thursday, May 10, 2012

Using geohashing in Solr

LatLonType might be sufficient for many use-cases regarding Solr spatial searching.  It's a pretty good option and performs decently in benchmarking I've seen.  However, I'm interested in geohashing.  While there isn't a great option out of the box (yet), SOLR-2155 is interesting and I've given it a try.  It was written by David Smiley, whose Solr 3 book I use almost every day.

I grabbed the v1.0.5 JAR from SOLR-2155's github repo.  It's a little confusing where to drop it but I decided on /usr/local/solr/solr/lib/ based on David's advice (see bottom of thread).  The instructions on how to configure schema.xml and solrconfig.xml on the github page are spot-on.  Obviously you have to restart Solr for these changes to take effect.

Then feeding in data is a snap.  I'm using Lily so I created a STRING field in Lily and wired it up to point to the solr2155.solr.schema.GeoHashField Solr field.  When creating Lily records, I send in "lat + ',' + lon" values.  I haven't been able to benchmark Solr (or Lily) indexing but it's no noticable drop-off from my previous attempts with double, tdouble, or LatLonType.

Searching performance is great.  Again, no benchmarks but everything feels snappy (great technical term, I know).  Bounding box queries like this come back as quick as text or string searches which is really thrilling:

http://s1:8983/solr/select?q=*:*&fq=geoloc:[1,-60 TO 5,-55]&shards=s2:8983/solr,s1:8983/solr,s3:8983/solr,s4:8983/solr

One issue I've seen is precision values in Solr result sets.  It looks like a Java double (ie, 3.89999233 is really 3.9).  The underlying data (3.9 in this case) gets preserved but Solr returns the modified value (like 3.89999233).  So just be careful about what you display.

I'll try to get some benchmarks but here's a good brief with some numbers.