Code Gouge: Python

Showing posts with label Python. Show all posts

Saturday, June 30, 2012

Simple model for writing cloud apps in Python

I've been writing a lot of apps against so-called 'big data' persistence and indexing backends recently and wanted to draw attention to a simple yet performant Python strategy. For those who want to store data in HBase and index in Solr, I've found success with the happybase and solrpy modules.

happybase is a new Python module with a very clean API which leverages off of Thrift. You can tell the creator (a really nice guy named Wouter) has strived for maximum simplicity since the API consists of a grand total of 3 classes: Connection, Table, and Batch! To use an example from the docs:

import happybase

connection = happybase.Connection('hostname')
table = connection.table('table-name')
table.put('row-key', {'family:qual1': 'value1',
                      'family:qual2': 'value2'})

Pretty simple, huh? 'Big data' often feels very Pythonic to me so I love keeping values in dictionaries which obviously works great with the happybase API. I did my writes via Batch (if you're not using Batch, you should question whether or not HBase is a good use-case for you) and got good, though unquantified, performance. If I have time, I'll go back and lay some benchmarks but it was in the ballpark of other tools I've used, like Lily which uses Avro instead of Thrift.

On to solrpy. I've been bitten trying to work large amounts of data over http (particularly at index time) before but find solrpy is performant. It works like you would expect - let's steal one of their examples:

import solr

# create a connection to a solr server
s = solr.Solr('http://example.org:8083/solr')

# add a document to the index
doc = dict(
    id=1,
    title='Lucene in Action',
    author=['Erik Hatcher', 'Otis Gospodnetić'],
    )
s.add(doc, commit=True)

# do a search
response = s.select('title:lucene')
for hit in response.results:
    print hit['title']

Again, notice data is passed using a dictionary! I'm sending lots of docs to the Solr server at a time so the Solr.add_many() function comes in handy. Hundreds of thousands of small/medium docs (all units relative, obviously) can be added in just a few minutes via a single persistent http connection. When I profile the add_many() call, there is a sizable lag before docs start to flow to Solr which I guess means the data is being aggregated locally - I wish it was streaming but oh well.

These modules prove you can get your Python apps up and running very easily in an HBase/Solr environment. Have fun!

Sunday, March 18, 2012

OpenShift test drive

I'm looking for an all-in-one PaaS solution. Currently, my little Android app hosts its data on AWS SimpleDB (I need to swap this out with something relational) and uses GAE for authentication services. While it's been interesting to use these systems, I'd like a single solution.

OpenShift had a booth at PyCon 2012 plus they were handing out free tee shirts. Sold! The only code I'm interested in right now is Java and Python which they of course support (among others). OpenShift also comes with MySQL and MongoDB. I don't believe OpenShift is in production status so I don't know what the pricing structure will be. Whatever it is, I'm sure I'll stay firmly in the 'free' range just as with AWS and GAE. Maybe one day I'll code something that does real volume and I'll have to pay! ;-)

Getting started was a snap. Their control panel UI is still pretty new and can use some polish but OpenShift is very straightforward so it wasn't confusing. git does all the heavy lifting in terms of deploying code, at least for an Express app. They provide a very nice Django example which was much appreciated and worked out of the box. I'm interested in checking out Tornado which is easily stood up in OpenShift. I use the requests module all the time (requests is a urllib2 replacement) and bringing it into my OpenShift app took like 10 characters of typing. I can even ssh to my virtual app, write to the filesystem and fire up daemons at will!

I'm leaning towards porting my Android app to run on OpenShift. If I don't, I'm sure I'll find something else to use it for (probably something for my graduate studies). Check it out.

UPDATE 5/5/12: I used OpenShift to host my semester project for a grad school class I'm taking. It's available (for now, probably not forever) at http://tornado-ryancutter.rhcloud.com/. The app attempts to process large amounts of Google+ content and rank attached links intelligently.

The only issue I had with OpenShift was a lack of support for python multiprocessing. The issue is documented here. When I discovered the problem, I hopped on IRC and was able to immediately talk to an OpenShift rep (on a Saturday, no less). Talk about customer service! Hopefully it'll be fixed one day. I was able to use python threading to do what I needed to do but I can see how this would be a showstopper for some people.

Overall, very impressed with OpenShift.

Friday, March 2, 2012

Running Jython apps using HBase

Jython is a great tool to have in your arsenal - currently using it for prototyping but it could do much more I'm sure. After yum installing jython onto a Centos 6 machine (not recommended as much newer versions exist), I used this great writeup to set up Jython and HBase:

$ export HBASE_HOME=/usr/bin/hbase

$ export JYTHON_HOME=/usr/bin/jython

$ export CLASSPATH=`$HBASE_HOME classpath`

$ alias pyhbase='HBASE_OPTS="-Dpython.path=$JYTHON_HOME" HBASE_CLASSPATH=/usr/share/java/jython.jar $HBASE_HOME org.python.util.jython'

$ pyhbase

Jython 2.2.1 on java1.6.0_30

>>> from org.apache.hadoop.hbase.client import HTable

>>>

Once the environment is set up, you can write an app. Let's say you have a table called 'test' with a column family named 'stuff' and column named 'value'.

import java.lang from org.apache.hadoop.hbase import HBaseConfiguration, HTableDescriptor, HColumnDescriptor, HConstants from org.apache.hadoop.hbase.client import HBaseAdmin, HTable, Get from org.apache.hadoop.hbase.util import Bytes
conf = HBaseConfiguration() admin = HBaseAdmin(conf)
tablename = "test" table = HTable(conf, tablename)
row = 'some_row_key' g = Get(Bytes.toBytes(row)) res = table.get(g)
val = res.getValue(Bytes.toBytes('stuff'), Bytes.toBytes('value'))
print Bytes.toString(val, 0, len(val))

The only weird behavior I couldn't explain is the need to overload that last Bytes.toString() call. I posted a question on StackOverflow but didn't get any great feedback. I'm thinking it's something in Jython maybe(?).

Friday, June 24, 2011

Remote backup of SimpleDB domains

So Amazon Web Services is pretty good but they're still lacking some of the small stuff. I'd like to see automated usage reports, warnings about suspicious traffic surges, and better backup options. I really don't understand why they don't assign a few dudes to build these things out because I know people have been begging for them for years.

Anyway, I coded a really simple Python script that will back up my domains. In no way is this a scalable solution - if you have a couple million items in your database, this is likely not a cost effective option. I've got less than a thousand entries today and I'm adding only a few hundred per week. It's hosted on WebFaction, which is a great hosting company - for only a few bucks per month you get access to your own very capable virtual server (no more mooching off friends or using inferior free services). It also uses boto, which is a very mature Python library for all things AWS. I've added it to my crontab to run a few days per week and it sends me an email when it's done. Of course, all usernames, passwords, email addresses removed. So here it is:

#! /usr/bin/env python2.7

from boto.sdb.connection import SDBConnection
from datetime import date
from smtplib import SMTP
from email.mime.text import MIMEText
import os
import settings

conn = SDBConnection('access_key','secret_key')

# backup account domain
account_filename = 'backup/account-' + date.today().isoformat() + '.xml'
f = open(account_filename, 'w')
domain = conn.get_domain("account")
domain.to_xml(f)
f.close()
account_size = os.stat(account_filename)

# backup timelog domain
timelog_filename = 'backup/timelog-' + date.today().isoformat() + '.xml'
f = open(timelog_filename, 'w')
domain = conn.get_domain("timelog")
domain.to_xml(f)
f.close()
timelog_size = os.stat(timelog_filename)

msg = MIMEText("wrote " + account_filename + " (" + str(account_size.st_size) + " bytes)\nwrote " + timelog_filename + " (" + str(timelog_size.st_size) + " bytpes)")
msg['Subject'] = 'Payroll Nanny backup report for ' + date.today().isoformat()
msg['From'] = 'my_app_address'
msg['To'] = 'my_email_address'

s = SMTP()
s.connect('smtp.webfaction.com')
s.login('my_webfaction_username', 'my_webfaction_password')
s.sendmail('my_app_address', 'my_email_address', msg.as_string())
s.quit()

Sunday, May 22, 2011

A solution in search of a problem

I don't believe in algorithmic trading methods though I am aware that most stocks bought and sold every day are done so because a computer decided to pull the trigger.

Be that as it may, the real reason for my foray into this world was to write a program with server-side Python which utilizes a CouchDB storage solution and an Android app which visualizes the data on a canvas. Oh yeah, and lots and lots of JSON.

I don't really have a name for these programs - I call the collection of Python scripts ZapDome. The Android app is called WebFrenzy. I've open sourced the Python stuff but like I told a startup guy this week, please don't judge me by the excellence of this code - I'm a Python noob. I'll post the WebFrenzy code after I clean it up a bit. It's a real mess right now. Perhaps most importantly, it's yielded a pretty clever (in my opinion) lightweight Java CouchDB library for Android devices specifically built for interacting with Cloudant servers called BarcaJolt. I'll blog about that later.

Anyway, my algorithm is to look for significant volume trends throughout the day and compare those intraday prices with the closing price. So I collect data at the top of the hour 10AM - 3PM EST then sweep up the closing price at 5PM EST. If the volume is >= 125% or <= 75% what SHOULD be according to it's moving volume average at that time, I flag it.

For example, if a stock is averaging 1,000,000 shares of volume a day and at noon EST it's already racked up 900,000 shares traded, that obviously meets my definition of significant volume. Let's say at noon EST, this stock's price is $2.00 and at the end of the day, it closes at $3.00. What if this happens often? That is, hours when the volume is "high", it almost always posts a profit according to it's closing price?

Honestly, I don't really care what the algorithm is. I'm never going to use this data - I'm a value investor!. But it's a fun hack and gets me experience in things I want to learn more about (Python, CouchDB, JSON, Android apps).

So here's a couple screen caps of my Android app showing the behavior of 2 stocks I've been tracking over the last week (I've only pulled a week's worth of data). Obviously you can't pull any trends out of such a small data sample. The data is on an X/Y graph (Vollume/Price). So the top right quadrant shows instances where the stock's volume was high and the stock price rose before the close. More to follow.

Sunday, May 8, 2011

Some basic couchdbkit 'views' (get it?)

We had to go to LA this weekend and I made my wife drive so I could learn up on CouchDB views. I had to re-read the Design Document and View sections of my O'Reilly book a couple times before it really clicked. Thankfully, the book is detailed enough that I didn't need to be looking at a database through Futon to get the point.

So today I hacked together a simple view in a Python script. This returns all my documents with a .symbol of 'C'. I'm not good at JavaScript (yet) so this is about as advanced as I want to get as I focus on Python and CouchDB:

design_doc = {
    '_id': '_design/test',
    'language': 'javascript',
    'views': {
        'allcitis': {
            "map": """function(doc) { if (doc.symbol == "C") { emit(doc._id, doc); }}"""
            }
        }
    }
frenzydb.save_doc(design_doc)

No problem - worked the first time and I saw it my Cloudant database. Parsing the view results through couchdbkit to a little longer to figure out. The Getting Started page ends with:

greets = Greeting.view('greeting/all’)

It took me a little while to figure out 'greets' wasn't a list, it's a couchdbkit.client.ViewResults. As it says in the comments for couchdbkit/client.py, "It return an ViewResults object on which you could iterate, list, ...". Okay great, throw it in a list. Here's a trivial function to get the average of 'price' using the aforementioned view:

citis = list(frenzydb.view('/test/allcitis'))

sum = 0
for citi in citis:
sum += float(dict(citi['value'])['price'])
print round(sum / len(citis), 2)

Friday, May 6, 2011

Google App Engine is pretty great (if you go all the way)

So I'm starting to bang around another project. As I describe it on my Careers 2.0 profile:

Project is still in its infancy but it will have server (GAE-hosted Python program evaluating JSON feeds of stock data with Cloundant [CouchDB] backend) and mobile (Android app to visualize/manipulate the data from Cloundant) components. There really isn't a point behind this program and I don't necessarily condone algorithmic trading - it's just a fun hack.

Or so I thought. Doesn't look like I'll be using Google App Engine because I don't want to do it the "GAE way" (not that there's anything wrong with that). Deploying an app to GAE is super simple but I ran into problems when starting to incorporate 3rd-party libraries (couchdbkit and restkit to be exact). These libs are too small to be part of the standard runtime environment. It's interesting to note that even Django is implemented in a slightly different way on GAE.

I'll spare the details (though they are available in my stackoverflow question) but I was having importing couchdbkit into my script. It was choking on things likes sockets and resources, which made me think I was running afoul of the sandbox rules:

An App Engine application cannot:

write to the filesystem. Applications must use the App Engine datastore for storing persistent data. Reading from the filesystem is allowed, and all application files uploaded with the application are available.
open a socket or access another host directly. An application can use the App Engine URL fetch service to make HTTP and HTTPS requests to other hosts on ports 80 and 443, respectively.
spawn a sub-process or thread. A web request to an application must be handled in a single process within a few seconds. Processes that take a very long time to respond are terminated to avoid overloading the web server.
make other kinds of system calls.

I put a question on the Cloudant discussion board and got not 1 but 2 answers from Cloudant techs within an hour (these dudes are good - I don't even have a paying account and they are very responsive). Anyway, they tipped me off to this telling Quora answer. It looks like the Django/Python hosting space is about to blow up with Heroku-like solutions.

Both the Cloudant tech and Django responder (a Django co-inventor) had good things to say about WebFaction. It is not a free hosting service ($9.50 per month and goes down with longer term contracts). After a quick Q&A session with a sales rep, I took the plunge. 1 hour later, my program was working perfectly on my WebFaction account. Success! The only thing that makes me nervous is having to do any sys admin stuff at all - getting my script and installations to reference the right Python version (WebFaction comes everything from 2.4 [default] up through 3.2) is about all I care to handle.

Sunday, May 1, 2011

Python + Couchdbkit = nice

So I'm working on a CouchDB project and you can read more about what I'm doing on the mobile side here.. There will be a server-side component and I want to get my feet wet with Python. I'm currently working through the tutorial but like most people I learn more by doing.

There are several CouchDB libraries for Python but I went with Couchdbkit. This decision was influenced by a post on the Cloudant discussion board stating Couchdbkit and Cloudant should work well together. So far, I have to agree.

The installation instructions for Couchdbkit are pretty straight forward. For the record, I'm using Python 2.6.6. I did find I needed to use restkit for HTTP resources. I set my sights low - I just wanted to get the program of Couchdbkit's Getting Started page to work. Username/password redacted to protect the innocent:

#! /usr/bin/env python

import datetime
from couchdbkit import *
from restkit import BasicAuth

class Greeting(Document):
    author = StringProperty()
    content = StringProperty()
    date = DateTimeProperty()

server = Server('https://[username].cloudant.com/', filters=[BasicAuth('[usename]', '[password]')])
db = server.get_or_create_db("greeting")

Greeting.set_db(db)
greet = Greeting(
    author="Benoit",
    content="Welcome to couchdbkit world",
    date=datetime.datetime.utcnow()
)
greet.save()

There are more exercises on the Getting Started page including an introduction to views. This worked as advertised with no need to change the supplied code.