Sunday, December 30, 2012
Solr 4.0 - solid geospatial now comes standard
Where SOLR-2155 required an additional JAR, Solr 4.0 ships with a solid geospatial solution. I've been using solr.SpatialRecursivePrefixTreeFieldType with pretty much standard options.
There was a lot of weirdness in Solr 3 which is now fixed. Namely, bounding boxes were actually circles - now we get simple bounding rectangles. In fact, polygons are within our reach. If you don't mind some LGPL in your app, you can include JTS and get it with Solr 4.
I don't have metrics to quantify performance yet. Once I get more data indexed, I'll try to post some numbers. Thanks David!
Sunday, September 23, 2012
Lily cluster setup - HDFS permission issue and solution
Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: Permission denied: user=lily, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
But when I started the Lily server as root, it seemed to work but this wasn't an optimal solution. This means my hdfs directory permissions weren't set properly since lily contacts hdfs as whatever user you start as. I had to set the user/group of /lily to lily:lily from the namenode:
[user@master01 ~]$ sudo su hdfs
[sudo] password for user:
bash-4.1$ hadoop fs -chown -R lily:lily /lily
This made the hdfs folder happy and my datanodes were able to start the lily service without becoming root.
In some cases, you might need to create the lily directory just like you do when setting up mapred. To do so:
[user@master01 ~]$ sudo su hdfs
[sudo] password for user:
bash-4.1$ hadoop fs -mkdir /lily
Friday, August 3, 2012
AWS Free Usage Tier experience
- I'm a very small fry.
- The AWS Free Usage Tier is pretty generous.
- 0.052 Compute-Hours (25 free)
- 0.000529 GB-Month of storage (1 GB-Month free)
- 0.001 GB data transfer in (1 GB free)
- 0.002 GB data transfer out (1 GB free)
Saturday, June 30, 2012
Simple model for writing cloud apps in Python
happybase is a new Python module with a very clean API which leverages off of Thrift. You can tell the creator (a really nice guy named Wouter) has strived for maximum simplicity since the API consists of a grand total of 3 classes: Connection, Table, and Batch! To use an example from the docs:
import happybase connection = happybase.Connection('hostname') table = connection.table('table-name') table.put('row-key', {'family:qual1': 'value1', 'family:qual2': 'value2'})
Pretty simple, huh? 'Big data' often feels very Pythonic to me so I love keeping values in dictionaries which obviously works great with the happybase API. I did my writes via Batch (if you're not using Batch, you should question whether or not HBase is a good use-case for you) and got good, though unquantified, performance. If I have time, I'll go back and lay some benchmarks but it was in the ballpark of other tools I've used, like Lily which uses Avro instead of Thrift.
On to solrpy. I've been bitten trying to work large amounts of data over http (particularly at index time) before but find solrpy is performant. It works like you would expect - let's steal one of their examples:
import solr # create a connection to a solr server s = solr.Solr('http://example.org:8083/solr') # add a document to the index doc = dict( id=1, title='Lucene in Action', author=['Erik Hatcher', 'Otis Gospodnetić'], ) s.add(doc, commit=True) # do a search response = s.select('title:lucene') for hit in response.results: print hit['title']
Again, notice data is passed using a dictionary! I'm sending lots of docs to the Solr server at a time so the Solr.add_many() function comes in handy. Hundreds of thousands of small/medium docs (all units relative, obviously) can be added in just a few minutes via a single persistent http connection. When I profile the add_many() call, there is a sizable lag before docs start to flow to Solr which I guess means the data is being aggregated locally - I wish it was streaming but oh well.
These modules prove you can get your Python apps up and running very easily in an HBase/Solr environment. Have fun!
Thursday, May 10, 2012
Using geohashing in Solr
I grabbed the v1.0.5 JAR from SOLR-2155's github repo. It's a little confusing where to drop it but I decided on /usr/local/solr/solr/lib/ based on David's advice (see bottom of thread). The instructions on how to configure schema.xml and solrconfig.xml on the github page are spot-on. Obviously you have to restart Solr for these changes to take effect.
Then feeding in data is a snap. I'm using Lily so I created a STRING field in Lily and wired it up to point to the solr2155.solr.schema.GeoHashField Solr field. When creating Lily records, I send in "lat + ',' + lon" values. I haven't been able to benchmark Solr (or Lily) indexing but it's no noticable drop-off from my previous attempts with double, tdouble, or LatLonType.
Searching performance is great. Again, no benchmarks but everything feels snappy (great technical term, I know). Bounding box queries like this come back as quick as text or string searches which is really thrilling:
http://s1:8983/solr/select?q=*:*&fq=geoloc:[1,-60 TO 5,-55]&shards=s2:8983/solr,s1:8983/solr,s3:8983/solr,s4:8983/solr
One issue I've seen is precision values in Solr result sets. It looks like a Java double (ie, 3.89999233 is really 3.9). The underlying data (3.9 in this case) gets preserved but Solr returns the modified value (like 3.89999233). So just be careful about what you display.
I'll try to get some benchmarks but here's a good brief with some numbers.
Monday, March 19, 2012
What to do when your free app doesn't take off
- Immediately shut down and get out of Dodge: At best this is rude and at worst you've really screwed your former users as they've lost their data forever. Not a valid option for responsible developers.
- Hang a going out of business sign: Even if everything else is free, there is an opportunity cost is keeping a 'failed' idea going (discussed more in #3). It's perfectly reasonable for someone to want to move on. The crucial ingredients here are sufficient notice and data reclamation options.
- Build in at least one month of lead time before services get shut off. Two or three months would be more considerate.
- Stop accepting new users (unpublish for apps, no more signups for web sites).
- Notify users via email and/or web site banner.
- If you have an app, consider publishing a final version that includes a conspicuous popup every time the app is launched informing them of the impending shutdown. Drive the point home any way you can.
- In the last couple weeks, review your logs/database for people still uploading data and send them a friendly reminder.
- Give users lots of options for downloading their data to convenient formats (CSV, etc). Even if you didn't have this feature before, you're going to need to build it to shutdown gracefully.
- Remain available even after shutdown for stragglers.
- Keep your product available in perpetuity: If its not costing you anything, you could keep it going for the long-term. There are disadvantages and risks! APIs change. Users will still seek support. You could get a bill one day for a service you consume.
Additionally, there should be an expectation that someone is minding the store. Developers who go this route can never completely disconnect. Consider not accepting new users and being up front about your intent to stop iterating.
Sunday, March 18, 2012
OpenShift test drive
UPDATE 5/5/12: I used OpenShift to host my semester project for a grad school class I'm taking. It's available (for now, probably not forever) at http://tornado-ryancutter.rhcloud.com/. The app attempts to process large amounts of Google+ content and rank attached links intelligently.
The only issue I had with OpenShift was a lack of support for python multiprocessing. The issue is documented here. When I discovered the problem, I hopped on IRC and was able to immediately talk to an OpenShift rep (on a Saturday, no less). Talk about customer service! Hopefully it'll be fixed one day. I was able to use python threading to do what I needed to do but I can see how this would be a showstopper for some people.
Overall, very impressed with OpenShift.
Friday, March 2, 2012
Running Jython apps using HBase
$ export HBASE_HOME=/usr/bin/hbase
$ export JYTHON_HOME=/usr/bin/jython
$ export CLASSPATH=`$HBASE_HOME classpath`
$ alias pyhbase='HBASE_OPTS="-Dpython.path=$JYTHON_HOME" HBASE_CLASSPATH=/usr/share/java/jython.jar $HBASE_HOME org.python.util.jython'
$ pyhbase
Jython 2.2.1 on java1.6.0_30
>>> from org.apache.hadoop.hbase.client import HTable
>>>
Once the environment is set up, you can write an app. Let's say you have a table called 'test' with a column family named 'stuff' and column named 'value'.
import java.lang
from org.apache.hadoop.hbase import HBaseConfiguration, HTableDescriptor, HColumnDescriptor, HConstants
from org.apache.hadoop.hbase.client import HBaseAdmin, HTable, Get
from org.apache.hadoop.hbase.util import Bytes
conf = HBaseConfiguration()
admin = HBaseAdmin(conf)
tablename = "test"
table = HTable(conf, tablename)
row = 'some_row_key'
g = Get(Bytes.toBytes(row))
res = table.get(g)
val = res.getValue(Bytes.toBytes('stuff'), Bytes.toBytes('value'))
print Bytes.toString(val, 0, len(val))
The only weird behavior I couldn't explain is the need to overload that last Bytes.toString() call. I posted a question on StackOverflow but didn't get any great feedback. I'm thinking it's something in Jython maybe(?).
Wednesday, January 25, 2012
Setting up Duplicity with GnuPG
sudo yum install duplicity
gpg --gen-key
gpg: checking the trustdb
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0 valid: 1 signed: 0 trust: 0-, 0q, 0n, 0m, 0f, 1u
pub 2048R/12345678 2012-01-26
.....
gpg --output secret --export-secret-keys
gpg --output public --export
gpg --import /path/to/secret
gpg --import /path/to/public
gpg --list-keys
gpg: : There is no assurance this key belongs to the named user
gpg: [stdin]: sign+encrypt failed: Unusable public key
gpg --edit-key [key]
> trust
// decide how much to trust it
> save