Saturday, April 20, 2013

Using Python and paramiko to move bits around

Automating the movement of files around a stack is one of those unglamorous jobs that just has to get done.  It's certainly possible to do this in bash but Python makes it a little more easy and fun.

paramiko is an actively maintained Python module that assists with secure connections to remote hosts.  This use cases favors the SFTPClient class which easily allows an app to connect to and put files on a remote machine.  paramiko is LGPL and only requires Python 2.5+ and PyCrypto 2.1+.

It's probably best to create an account on the remote host with a very limited scope.  That is, it only has permission to modify the directory which you intend to write to.  Those creds will go in the Python script.  Don't run this as root or your normal user account!

In this simple example, all files in a particular local directory will be recursively copied to a remote host.  Enjoy!

Monday, February 18, 2013

Assembling ordered Lists using HashMap and Collections.sort()

LinkedHashMap won't work if that's what you were thinking....

I was presented with a use case having to assemble an ordered List of objects out of tens/hundreds of thousands of ordered elements.  Here's the List<EventObject> I need to do something with:

  • Keys are not unique.
  • The List<EventObject> needs to become a List<SummarizedObject>.
  • 1 or more EventObjects make up a SummarizedObject.
  • The order of List<EventObject> must be matched in the resulting List<SummarizedObject>.  Meaning, the first EventObject will definitely be in the first SummarizedObject and the last EventObject will definitely be in the last SummarizedObject.
  • Typically, for every 10k EventObject values 100 unique keys and 200 unique SummarizedObjects are created (give or take 1 order of magnitude).
  • I'm not concerned about concurrency at this time.
LinkedHashMap<String, List<SummarizedObject>> wouldn't work for me even though it maintains order because a single key could have several SummarizedObjects (hence the List of SummarizedObjects) whose order could interweave with values in different keys.  Bummer.

I kept coming back to the HashMap.  Being able to quickly determine a key's existence in the Map and (if it does exist) going straight to that key's List<SummarizedObject> was ideal.  So I dropped an order field in SummarizedObject, which looked like:


The pseudo-code for getting the EventObjects into a HashMap<String, List<SummarizedObject>> is:


So my last problem is to assemble all SummarizedObjects into a single, properly ordered List.  Enter Collections.sort(). Here's real code:


Great performance and not that much code. And it can be optimized further by finding a better way to find the proper SummarizedObject in Step 5 of the above pseudo-code. Hope this helps someone!

Saturday, January 26, 2013

Marshaling POJOs with Jersey

When a web service has multiple consumers, you obviously don't want to duplicate a lot of code.  So what if you want to push data to/from a Java API and a web site like jQuery?  Jersey can help you do both with JSON.  A lot of what I'll cover is in the Jersey docs.
  • Turn JSON support on in your servlet's web.xml:
<init-param>
  <param-name>com.sun.jersey.api.json.POJOMappingFeature</param-name>
  <param-value>true</param-value>
</init-param>
  • This example also uses compression so add the gzip filters too.
  • Annotate your POJO with JAXB.  Make sure you have an empty constructor.  I like to make the member variables private and force use of public getters.  I also elect to ignore unknown properties.
@XmlRootElement(name = "foo")
@XmlAccessorType(XmlAccessType.PUBLIC_METHOD)
@JsonIgnoreProperties(ignoreUnknown = true)
public class Foo {
    private String blah;

    Foo() {}
    
    @XmlElement()
    public String getBlah() { return blah; }
}
  • Set up your web service.  Since you're returning JSON, jQuery will be happy with basically no further consideration.  Note I didn't define a setBlah() above but I hope you get the picture.  
@GET @Path("/")
@Produces({MediaType.APPLICATION_JSON})
public Foo getFoo() {
  Foo foo = new Foo();
  foo.setBlah("Hello!");
  return foo;
}
  • Connect your WebResource properly.  Note the addition of the JacksonJsonProvider class.  The gzip filter is not required but you should probably use it.

ClientConfig config = new DefaultClientConfig();
config.getClasses().add(JacksonJsonProvider.class);
Client client = Client.create(config);
client.addFilter(new GZIPContentEncodingFilter(true)); 
WebResource webResource = client.resource(hostname);
  • Now request your POJO response (note the use of gzip here):

Foo foo = webResource.path("/")
               .header("Accept-Encoding", "gzip")
               .get(new GenericType<Foo>() {});
  • You can also pass POJOs to the web service using POSTs.  This snippet assumes you annotated class Bar is a similar manner to Foo and the "/bar" POST endpoint accepts a Bar object.

Bar bar = new Bar("stuff");

Foo foo = webResource.path("/bar").post(new GenericType<Foo>() {}, bar);

Edited on 2/10/13 to include compression in example.

Sunday, January 6, 2013

"Multiple dex files" error with Eclipse + ADT

I haven't done much Android development in the last year but this week wanted to update my app a bit.  So I installed Eclipse Juno (4.2), grabbed the latest Android Developer Tools (r21), and the latest AWS SDK for Android (1.4.4) since my app persists data on AWS.

These are all pretty big jumps from what I used to have - Eclipse Indigo, probably a single digit ADT, and AWS SDK 0.2.X IIRC.  I've only been using my System76 laptop for grad school work of late, not "real" coding obviously.  Almost everything worked fine.  Thanks go out to these teams for honoring backward compatibility!  The one problem I had was "Unable to execute dex: Multiple dex files define X" (X being something in the AWS package).

So off to StackOverflow.  I tried all the suggestions:
  • Open/close Eclipse a few times, doing project cleans
  • Reinstalling ADT
  • Deleting my /bin and /gen directories, then cleaning
  • Made sure my build path was legit - several people traced it back to having /bin in the build path
The AWS JARs I imported were core, debug, and sdb (my app uses SimpleDB - don't ask).  Since my testing only consisted of moving the .apk to dropbox and making sure it worked on my Galaxy S3, I didn't need the debug JAR.  Once I removed it, everything worked okay.

Kinda perturbed this is still a problem - many SO posts and blogs I've seen are by people who appear to know what they're talking about so I don't think it's always a silly oversight by junior developers.  I'm thinking it's a dependency problem - maybe AWS SDK 1.4.4 wasn't developed with ADT r21?  If that was the case, Eclipse and SDK providers make it really hard to grab older versions like you would with a POM file.  If I needed to debug in this instance, I'd be in real trouble.

Sunday, December 30, 2012

Solr 4.0 - solid geospatial now comes standard

I have a fairly large corpus of events with spatial data which are indexed in Solr.  LatLonType won't cut it for me.  I'm a huge fan of David Smiley's work with SOLR-2155 which I used to scale my app in Solr 3.  David's embraced spatial grid implementation and I've found his modules outperform LatLonType in my use cases - other benchmarks support this.

Where SOLR-2155 required an additional JAR, Solr 4.0 ships with a solid geospatial solution.  I've been using solr.SpatialRecursivePrefixTreeFieldType with pretty much standard options.

There was a lot of weirdness in Solr 3 which is now fixed.  Namely, bounding boxes were actually circles - now we get simple bounding rectangles.  In fact, polygons are within our reach.  If you don't mind some LGPL in your app, you can include JTS and get it with Solr 4.

I don't have metrics to quantify performance yet.  Once I get more data indexed, I'll try to post some numbers.  Thanks David!

Sunday, September 23, 2012

Lily cluster setup - HDFS permission issue and solution

Really enjoy using Lily, a framework for easily working with HBase and Solr.  I hadn't set up a Lily cluster in awhile and was perplexed with these kinds of errors when starting the Lily service from my datanodes:

Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: Permission denied: user=lily, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x

But when I started the Lily server as root, it seemed to work but this wasn't an optimal solution.  This means my hdfs directory permissions weren't set properly since lily contacts hdfs as whatever user you start as.  I had to set the user/group of /lily to lily:lily from the namenode:

[user@master01 ~]$ sudo su hdfs
[sudo] password for user:
bash-4.1$ hadoop fs -chown -R lily:lily /lily

This made the hdfs folder happy and my datanodes were able to start the lily service without becoming root.

In some cases, you might need to create the lily directory just like you do when setting up mapred.  To do so:

[user@master01 ~]$ sudo su hdfs
[sudo] password for user:
bash-4.1$ hadoop fs -mkdir /lily

Friday, August 3, 2012

AWS Free Usage Tier experience


I've had an app in the wild for about 1.5 years.  I'm in pure maintenance mode so other than replying to user emails, I check out error reports, reviews, and download stats every couple months.  Occasionally, I'll peak into the persistence to see what the raw data is looking like (shout out to sdbtool!).  But it's been some time since I looked at how much resources I'm consuming on AWS.  I built this app thinking I would always fall in the Free Tier pricing level.

My AWS usage report tells me 2 things:
  • I'm a very small fry.
  • The AWS Free Usage Tier is pretty generous.

Before getting to the AWS stats, here's my use case.  I probably have only a couple hundred active users today.  I'm using AWS for SimpleDB persistence (no graph processing here!) so all the compute time is dedicated to CRUD with the database.  SimpleDB rows are attribute-value pairs (Strings) and I average around half-a-KB of data per row.  I have about 25k rows of data in my SimpleDB instance.

My SimpleDB usage for July 2012:
  • 0.052 Compute-Hours (25 free)
  • 0.000529 GB-Month of storage (1 GB-Month free)

I was confused by the GB-Month number.  I thought that meant how much data I'm storing in my SimpleDB instance but it must refer to the amount of new data.  So that's half a MB of new data coming in which translates to around 1,600 new items ('rows' in SimpleDB parlance) in my domains (SimpleDB's word for 'tables') in July.  In fact, the total amount of disk space is represented by the TimedStorage-ByteHrs field of the detailed usage report.

My AWS usage for July 2012:
  • 0.001 GB data transfer in (1 GB free)
  • 0.002 GB data transfer out (1 GB free)

Since I serve up much more data than I ingest, the data out number is my only concern.  That's 2 MB of data leaving AWS and my app is fielding about 18K Select requests for items (SimpleDB's term for 'rows') each month.  Note the request might not be for an entire item - it could be for 1 or more attributes ('columns').

Add these numbers up and I obviously have room to scale!  In fact, I could get almost a 500x increase in activity and still fall within the free usage tier.  Since I only coded this app for fun and experience, I probably won't grow but it's fun to know where I'm at.  It's really exciting to consider all the backend options out there for developers (GAE is similarly generous).