Code Gouge: June 2012

I've been writing a lot of apps against so-called 'big data' persistence and indexing backends recently and wanted to draw attention to a simple yet performant Python strategy. For those who want to store data in HBase and index in Solr, I've found success with the happybase and solrpy modules.

happybase is a new Python module with a very clean API which leverages off of Thrift. You can tell the creator (a really nice guy named Wouter) has strived for maximum simplicity since the API consists of a grand total of 3 classes: Connection, Table, and Batch! To use an example from the docs:

import happybase

connection = happybase.Connection('hostname')
table = connection.table('table-name')
table.put('row-key', {'family:qual1': 'value1',
                      'family:qual2': 'value2'})

Pretty simple, huh? 'Big data' often feels very Pythonic to me so I love keeping values in dictionaries which obviously works great with the happybase API. I did my writes via Batch (if you're not using Batch, you should question whether or not HBase is a good use-case for you) and got good, though unquantified, performance. If I have time, I'll go back and lay some benchmarks but it was in the ballpark of other tools I've used, like Lily which uses Avro instead of Thrift.

On to solrpy. I've been bitten trying to work large amounts of data over http (particularly at index time) before but find solrpy is performant. It works like you would expect - let's steal one of their examples:

import solr

# create a connection to a solr server
s = solr.Solr('http://example.org:8083/solr')

# add a document to the index
doc = dict(
    id=1,
    title='Lucene in Action',
    author=['Erik Hatcher', 'Otis Gospodnetić'],
    )
s.add(doc, commit=True)

# do a search
response = s.select('title:lucene')
for hit in response.results:
    print hit['title']

Again, notice data is passed using a dictionary! I'm sending lots of docs to the Solr server at a time so the Solr.add_many() function comes in handy. Hundreds of thousands of small/medium docs (all units relative, obviously) can be added in just a few minutes via a single persistent http connection. When I profile the add_many() call, there is a sizable lag before docs start to flow to Solr which I guess means the data is being aggregated locally - I wish it was streaming but oh well.

These modules prove you can get your Python apps up and running very easily in an HBase/Solr environment. Have fun!

Code Gouge

Saturday, June 30, 2012

Simple model for writing cloud apps in Python