Sunday, December 30, 2012

Solr 4.0 - solid geospatial now comes standard

I have a fairly large corpus of events with spatial data which are indexed in Solr.  LatLonType won't cut it for me.  I'm a huge fan of David Smiley's work with SOLR-2155 which I used to scale my app in Solr 3.  David's embraced spatial grid implementation and I've found his modules outperform LatLonType in my use cases - other benchmarks support this.

Where SOLR-2155 required an additional JAR, Solr 4.0 ships with a solid geospatial solution.  I've been using solr.SpatialRecursivePrefixTreeFieldType with pretty much standard options.

There was a lot of weirdness in Solr 3 which is now fixed.  Namely, bounding boxes were actually circles - now we get simple bounding rectangles.  In fact, polygons are within our reach.  If you don't mind some LGPL in your app, you can include JTS and get it with Solr 4.

I don't have metrics to quantify performance yet.  Once I get more data indexed, I'll try to post some numbers.  Thanks David!

Sunday, September 23, 2012

Lily cluster setup - HDFS permission issue and solution

Really enjoy using Lily, a framework for easily working with HBase and Solr.  I hadn't set up a Lily cluster in awhile and was perplexed with these kinds of errors when starting the Lily service from my datanodes:

Caused by: org.apache.hadoop.ipc.RemoteException: Permission denied: user=lily, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x

But when I started the Lily server as root, it seemed to work but this wasn't an optimal solution.  This means my hdfs directory permissions weren't set properly since lily contacts hdfs as whatever user you start as.  I had to set the user/group of /lily to lily:lily from the namenode:

[user@master01 ~]$ sudo su hdfs
[sudo] password for user:
bash-4.1$ hadoop fs -chown -R lily:lily /lily

This made the hdfs folder happy and my datanodes were able to start the lily service without becoming root.

In some cases, you might need to create the lily directory just like you do when setting up mapred.  To do so:

[user@master01 ~]$ sudo su hdfs
[sudo] password for user:
bash-4.1$ hadoop fs -mkdir /lily

Friday, August 3, 2012

AWS Free Usage Tier experience

I've had an app in the wild for about 1.5 years.  I'm in pure maintenance mode so other than replying to user emails, I check out error reports, reviews, and download stats every couple months.  Occasionally, I'll peak into the persistence to see what the raw data is looking like (shout out to sdbtool!).  But it's been some time since I looked at how much resources I'm consuming on AWS.  I built this app thinking I would always fall in the Free Tier pricing level.

My AWS usage report tells me 2 things:
  • I'm a very small fry.
  • The AWS Free Usage Tier is pretty generous.

Before getting to the AWS stats, here's my use case.  I probably have only a couple hundred active users today.  I'm using AWS for SimpleDB persistence (no graph processing here!) so all the compute time is dedicated to CRUD with the database.  SimpleDB rows are attribute-value pairs (Strings) and I average around half-a-KB of data per row.  I have about 25k rows of data in my SimpleDB instance.

My SimpleDB usage for July 2012:
  • 0.052 Compute-Hours (25 free)
  • 0.000529 GB-Month of storage (1 GB-Month free)

I was confused by the GB-Month number.  I thought that meant how much data I'm storing in my SimpleDB instance but it must refer to the amount of new data.  So that's half a MB of new data coming in which translates to around 1,600 new items ('rows' in SimpleDB parlance) in my domains (SimpleDB's word for 'tables') in July.  In fact, the total amount of disk space is represented by the TimedStorage-ByteHrs field of the detailed usage report.

My AWS usage for July 2012:
  • 0.001 GB data transfer in (1 GB free)
  • 0.002 GB data transfer out (1 GB free)

Since I serve up much more data than I ingest, the data out number is my only concern.  That's 2 MB of data leaving AWS and my app is fielding about 18K Select requests for items (SimpleDB's term for 'rows') each month.  Note the request might not be for an entire item - it could be for 1 or more attributes ('columns').

Add these numbers up and I obviously have room to scale!  In fact, I could get almost a 500x increase in activity and still fall within the free usage tier.  Since I only coded this app for fun and experience, I probably won't grow but it's fun to know where I'm at.  It's really exciting to consider all the backend options out there for developers (GAE is similarly generous).

Saturday, June 30, 2012

Simple model for writing cloud apps in Python

I've been writing a lot of apps against so-called 'big data' persistence and indexing backends recently and wanted to draw attention to a simple yet performant Python strategy.  For those who want to store data in HBase and index in Solr, I've found success with the happybase and solrpy modules.

happybase is a new Python module with a very clean API which leverages off of Thrift.  You can tell the creator (a really nice guy named Wouter) has strived for maximum simplicity since the API consists of a grand total of 3 classes:  Connection, Table, and Batch!  To use an example from the docs:

import happybase

connection = happybase.Connection('hostname')
table = connection.table('table-name')
table.put('row-key', {'family:qual1': 'value1',
                      'family:qual2': 'value2'})

Pretty simple, huh?  'Big data' often feels very Pythonic to me so I love keeping values in dictionaries which obviously works great with the happybase API.  I did my writes via Batch (if you're not using Batch, you should question whether or not HBase is a good use-case for you) and got good, though unquantified, performance.  If I have time, I'll go back and lay some benchmarks but it was in the ballpark of other tools I've used, like Lily which uses Avro instead of Thrift.

On to solrpy.  I've been bitten trying to work large amounts of data over http (particularly at index time) before but find solrpy is performant.  It works like you would expect - let's steal one of their examples:

import solr

# create a connection to a solr server
s = solr.Solr('')

# add a document to the index
doc = dict(
    title='Lucene in Action',
    author=['Erik Hatcher', 'Otis Gospodnetić'],
s.add(doc, commit=True)

# do a search
response ='title:lucene')
for hit in response.results:
    print hit['title']

Again, notice data is passed using a dictionary!  I'm sending lots of docs to the Solr server at a time so the Solr.add_many() function comes in handy.  Hundreds of thousands of small/medium docs (all units relative, obviously) can be added in just a few minutes via a single persistent http connection.  When I profile the add_many() call, there is a sizable lag before docs start to flow to Solr which I guess means the data is being aggregated locally - I wish it was streaming but oh well.

These modules prove you can get your Python apps up and running very easily in an HBase/Solr environment.  Have fun!

Thursday, May 10, 2012

Using geohashing in Solr

LatLonType might be sufficient for many use-cases regarding Solr spatial searching.  It's a pretty good option and performs decently in benchmarking I've seen.  However, I'm interested in geohashing.  While there isn't a great option out of the box (yet), SOLR-2155 is interesting and I've given it a try.  It was written by David Smiley, whose Solr 3 book I use almost every day.

I grabbed the v1.0.5 JAR from SOLR-2155's github repo.  It's a little confusing where to drop it but I decided on /usr/local/solr/solr/lib/ based on David's advice (see bottom of thread).  The instructions on how to configure schema.xml and solrconfig.xml on the github page are spot-on.  Obviously you have to restart Solr for these changes to take effect.

Then feeding in data is a snap.  I'm using Lily so I created a STRING field in Lily and wired it up to point to the solr2155.solr.schema.GeoHashField Solr field.  When creating Lily records, I send in "lat + ',' + lon" values.  I haven't been able to benchmark Solr (or Lily) indexing but it's no noticable drop-off from my previous attempts with double, tdouble, or LatLonType.

Searching performance is great.  Again, no benchmarks but everything feels snappy (great technical term, I know).  Bounding box queries like this come back as quick as text or string searches which is really thrilling:

http://s1:8983/solr/select?q=*:*&fq=geoloc:[1,-60 TO 5,-55]&shards=s2:8983/solr,s1:8983/solr,s3:8983/solr,s4:8983/solr

One issue I've seen is precision values in Solr result sets.  It looks like a Java double (ie, 3.89999233 is really 3.9).  The underlying data (3.9 in this case) gets preserved but Solr returns the modified value (like 3.89999233).  So just be careful about what you display.

I'll try to get some benchmarks but here's a good brief with some numbers.

Monday, March 19, 2012

What to do when your free app doesn't take off

You write a great app, put it on iOS App Store and/or Android Market, err Google Play, quickly mature to a pay app, sit back and enjoy your steady stream of secondary income.

The majority of cases don't turn out this way. So what happens when your idea doesn't work?

This isn't a post about talent acquisitions or well poisoning. Nor do I have enough experience to comment on pivoting startups or apps/sites that are not free. It's scoped specifically to individuals who independently offer free apps or web sites that, ahem, don't work out. I've got a little experience in this area, having published a free Android time management app which stores user data remotely. It didn't take off but I have some active users who seem to like it enough.

If your product is a standalone widget and you want to kill it off, sundowning immediately shouldn't be too controversial. However if you've actively sought users who have entrusted you with their data, the answer isn't so clear. Not only have they invested time to learn your app, they've actually uploaded their data and expect to be able to access it.

Let's look at the options:
  1. Immediately shut down and get out of Dodge: At best this is rude and at worst you've really screwed your former users as they've lost their data forever. Not a valid option for responsible developers.

  2. Hang a going out of business sign: Even if everything else is free, there is an opportunity cost is keeping a 'failed' idea going (discussed more in #3). It's perfectly reasonable for someone to want to move on. The crucial ingredients here are sufficient notice and data reclamation options.
    • Build in at least one month of lead time before services get shut off. Two or three months would be more considerate.
    • Stop accepting new users (unpublish for apps, no more signups for web sites).
    • Notify users via email and/or web site banner.
    • If you have an app, consider publishing a final version that includes a conspicuous popup every time the app is launched informing them of the impending shutdown. Drive the point home any way you can.
    • In the last couple weeks, review your logs/database for people still uploading data and send them a friendly reminder.
    • Give users lots of options for downloading their data to convenient formats (CSV, etc). Even if you didn't have this feature before, you're going to need to build it to shutdown gracefully.
    • Remain available even after shutdown for stragglers.

  3. Keep your product available in perpetuity: If its not costing you anything, you could keep it going for the long-term. There are disadvantages and risks! APIs change. Users will still seek support. You could get a bill one day for a service you consume.

    Additionally, there should be an expectation that someone is minding the store. Developers who go this route can never completely disconnect. Consider not accepting new users and being up front about your intent to stop iterating.
Failed ideas don't have to be all bad - just be responsible and considerate to your users.

Sunday, March 18, 2012

OpenShift test drive

I'm looking for an all-in-one PaaS solution. Currently, my little Android app hosts its data on AWS SimpleDB (I need to swap this out with something relational) and uses GAE for authentication services. While it's been interesting to use these systems, I'd like a single solution.

OpenShift had a booth at PyCon 2012 plus they were handing out free tee shirts. Sold! The only code I'm interested in right now is Java and Python which they of course support (among others). OpenShift also comes with MySQL and MongoDB. I don't believe OpenShift is in production status so I don't know what the pricing structure will be. Whatever it is, I'm sure I'll stay firmly in the 'free' range just as with AWS and GAE. Maybe one day I'll code something that does real volume and I'll have to pay! ;-)

Getting started was a snap. Their control panel UI is still pretty new and can use some polish but OpenShift is very straightforward so it wasn't confusing. git does all the heavy lifting in terms of deploying code, at least for an Express app. They provide a very nice Django example which was much appreciated and worked out of the box. I'm interested in checking out Tornado which is easily stood up in OpenShift. I use the requests module all the time (requests is a urllib2 replacement) and bringing it into my OpenShift app took like 10 characters of typing. I can even ssh to my virtual app, write to the filesystem and fire up daemons at will!

I'm leaning towards porting my Android app to run on OpenShift. If I don't, I'm sure I'll find something else to use it for (probably something for my graduate studies). Check it out.

UPDATE 5/5/12: I used OpenShift to host my semester project for a grad school class I'm taking.  It's available (for now, probably not forever) at  The app attempts to process large amounts of Google+ content and rank attached links intelligently.

The only issue I had with OpenShift was a lack of support for python multiprocessing.  The issue is documented here.  When I discovered the problem, I hopped on IRC and was able to immediately talk to an OpenShift rep (on a Saturday, no less).  Talk about customer service!  Hopefully it'll be fixed one day.  I was able to use python threading to do what I needed to do but I can see how this would be a showstopper for some people.

Overall, very impressed with OpenShift.

Friday, March 2, 2012

Running Jython apps using HBase

Jython is a great tool to have in your arsenal - currently using it for prototyping but it could do much more I'm sure. After yum installing jython onto a Centos 6 machine (not recommended as much newer versions exist), I used this great writeup to set up Jython and HBase:

$ export HBASE_HOME=/usr/bin/hbase
$ export JYTHON_HOME=/usr/bin/jython
$ export CLASSPATH=`$HBASE_HOME classpath`
$ alias pyhbase='HBASE_OPTS="-Dpython.path=$JYTHON_HOME" HBASE_CLASSPATH=/usr/share/java/jython.jar $HBASE_HOME org.python.util.jython'
$ pyhbase
Jython 2.2.1 on java1.6.0_30
>>> from org.apache.hadoop.hbase.client import HTable

Once the environment is set up, you can write an app. Let's say you have a table called 'test' with a column family named 'stuff' and column named 'value'.

import java.lang
from org.apache.hadoop.hbase import HBaseConfiguration, HTableDescriptor, HColumnDescriptor, HConstants
from org.apache.hadoop.hbase.client import HBaseAdmin, HTable, Get
from org.apache.hadoop.hbase.util import Bytes

conf = HBaseConfiguration()
admin = HBaseAdmin(conf)

tablename = "test"
table = HTable(conf, tablename)

row = 'some_row_key'
g = Get(Bytes.toBytes(row))
res = table.get(g)

val = res.getValue(Bytes.toBytes('stuff'), Bytes.toBytes('value'))

print Bytes.toString(val, 0, len(val))

The only weird behavior I couldn't explain is the need to overload that last Bytes.toString() call. I posted a question on StackOverflow but didn't get any great feedback. I'm thinking it's something in Jython maybe(?).

Wednesday, January 25, 2012

Setting up Duplicity with GnuPG

Really enjoy the functionality of Duplicity. On CentOS:

sudo yum install duplicity

If you get "No package duplicity available.", you need to install EPEL. For CentOS 6:

sudo rpm -Uvh

Then try yum again.

To make a key using GnuPG:

gpg --gen-key

The defaults are fine. When the key is complete, make sure you copy down the key (made bold) because you'll pass it to duplicity:

gpg: checking the trustdb
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0 valid: 1 signed: 0 trust: 0-, 0q, 0n, 0m, 0f, 1u
pub 2048R/12345678 2012-01-26

You might need to export the key if another user will use it. In my case, I had to create the keys with one user but another user would execute the backups.

gpg --output secret --export-secret-keys
gpg --output public --export

Then the other user needs to:

gpg --import /path/to/secret
gpg --import /path/to/public

You can verify the keys are there by:

gpg --list-keys

If when using the key you get these errors:

gpg: : There is no assurance this key belongs to the named user
gpg: [stdin]: sign+encrypt failed: Unusable public key

You should (as the user experiencing this error):

gpg --edit-key [key]
> trust
// decide how much to trust it
> save

Now to actually use duplicity, it'll most likely be cron'd so a shell script would work nice. I like the way Justin Hartman did it so there's really no need to re-invent what he did. Just ignore the AWS stuff if you're not backing up there.

Saturday, January 7, 2012

Issues with new NIC on Centos 6.0 server

A Dell Poweredge went down with an E1410 error which I couldn't clear. The Motherboard/NIC had to be replaced. This resolved the error but brought about an annoying networking situation.

I really like this nixCraft Howto for Red Hat simple networking issues. After putting the new MAC address on the HWADDR line in /etc/sysconfig/network-scripts/ifcfg-eth0 and ifcfg-eth1 (just added 1 to eth0's MAC addr), I ran "/etc/init.d/network restart" expecting everything to come up. No such luck - got Fatal "Device eth0 does not seem to be present" errors. Verified the MAC address, swapped the addresses between eth0 and eth1, restarted the server, no joy.

Eventually someone smarter than me to me to look at eth2 and eth3. Turns out the NIC card was binding to eth3. To resolve this, I commented out the HWADDR line in ifcfg-eth0 and ifcfg-eth1 and restarted the server. Running "ifconfig eth0" showed the NIC now attached to eth0. I wanted this to always be the case so now I uncommented the HWADDR line in ifcfg-eth0 and ifcfg-eth1 and restarted again. I probably could have just restarted network not the server and got the same results but everything was good.