Code Gouge: 2013

Monday, December 9, 2013

Pushing artifacts to your local repo using Maven's Ant plugin

If you have an Ant module that you need to integrate into an existing Maven project, you'll likely face a few challenges. Let's just pretend you don't want to mess with Ant too much.

Despite the existence of the maven-antrun-plugin, they don't mesh perfectly together. One issue is getting the built artifact into your local Maven repository.

Here's one technique: once the Ant build is complete, reach into the target directory and copy the artifact into your repo while taking care to roll the path and version.
You might want to adjust this according to your environment. Perhaps adding more phases and breaking up the operations would be wise.

Saturday, November 30, 2013

Publishing an app for fun, experience, and no profit

The lure of passive income is well documented but for many developers the costs outweigh the benefits. Once you charge money for a web or mobile app, a responsibility to customers is formed - you have to fix bugs, keep backend services highly available, stand up a customer service process, etc.

I like tinkering in mobile apps but don't want that pressure. I created an app to scratch an itch. It's not great but it works. It was published a couple years ago when not many apps with comparable functionality existed and has built up a group of steady users. I recognize many Android apps are undesirable but put mine one level above that category (the UI is not pretty).

So why should developers self-publish apps that don't make any money?

It's fun!

This is my primary reason. Don't you like seeing your software being used? My users' data gets stored remotely so the more rows I see added, the more I know people are using it. Also, I love getting emails from people who just say "thanks".
Great technical experience

I fancy myself as a full-stack engineer so this has been a great way to see something through from beginning to end. I get to touch lots of different technologies - client code, security layers, database endpoints, etc.
Do something different

I don't cook up mobile apps in my day job. At least now I can talk reasonably intelligently about Android, AWS, and Google App Engine.

Get the first one out of the way

Maybe one day I'll want to build something independently for profit. Ever get the feeling you don't know what you don't know? I feel like this experience will help me avoid that problem, at least partially.

Monday, November 11, 2013

Temporally faceting through event data

I've been struggling to scale a use case I have for a large corpus of event data in Solr. It's kinda faceting but not really. facet.range, while awesome, doesn't get me there because I need specificity, not bins. Scale precludes simple sort. Consider the following events of people visiting locations:

docID, person, location, date
1, "John", "StoreA", 2013-01-01
2, "John", "StoreB", 2013-06-30
3, "John", "StoreA", 2013-07-11
4, "Frank", "StoreC", 2013-02-01
5, "Kim", "StoreA", 2013-08-01

The first row could read as "John visited StoreA on 1/1/13". I want to know data about each person's most recent trip to a particular location.

So let's say we facet on location "StoreA". Here's the HTTP request (shown instead of SolrJ code for the purpose of this blog post):

http://localhost:8983/solr/events/select?
q=location:StoreA
&rows=0
&facet=true
&facet.limit=-1
&facet.field=person
&facet.mincount=1

This executes a facet search for person which only returns facet_counts for events with location = "StoreA". For the above corpus, the applicable results are:

{ "John": 2, "Kim": 1 }

This is similar to what you'd expect on Amazon if you search on "television": you'll see a dozen or so Brands on which to drill into. But my request isn't just "Show me everyone who has been to StoreA", it's "For everyone who has been to StoreA, show me the most recent visit with the document ID". So what I'm really asking for is:

{ "John": ["2013-07-11", 3], "Kim": ["2013-08-01", 5] }

The scale (billions of records) and composition (people are usually seen at a small subset of locations) of my data makes it optimal to figure out who has been to StoreA before attempting to find their most recent visit. Most people haven't been to StoreA so eliminating them as quickly as possible is key.

It's been most performant to process the facet results then fetch the document with the most recent date for each hit (2 in this example). It would look something like this:

Issue the facet query - this will give you a list of the people who have been to "StoreA".
Concurrently fetch the info about each of these "StoreA" visitors. I use ExecutorService pools quite a bit:

http://localhost:8983/solr/events/select?
q=person:John+AND+location:StoreA
&sort=date+desc
&rows=1
&fl=docID,date

Even if you tune the run time of the facet query down to a couple seconds and each of the fetches in Step 2 take 10 ms, if you need to crunch through 10,000 of these matches, the response time isn't exactly instantaneous (over 5 seconds and probably closer to 10 seconds in this example). Plus, that's a ton of queries to send down range. You quickly get into a need to stage results or at least cache them. In fact, I've considered pre-processing schemes to focus my queries to people who have been seen at particular locations.

So I continue to experiment, trying to find the optimal solution. There has been lots of facet work in Solr of late so I hope that iteration leads to more tools at my disposal.

Saturday, August 24, 2013

Manually editing Solr's clusterstate.json on Zookeeper

There will probably come a time when you want to do something not readily covered in Solr's APIs. Manually editing the clusterstate.json is easy to do but should be approached with caution.

I wanted to drop all shards from a single host from my Solr cluster (didn't care about losing the data) and do a little spring cleaning. To open a command-line interface with ZK:

/path/to/zookeeper/bin/zkCli.sh -server 127.0.0.1:2181

Then simply:

get /clusterstate.json

Place that content into a local file. After backing up the original content, make your edits. For instance, to drop the dead nodes, delete JSON elements with "down" states:

To upload your new clusterstate.json (no need to halt services):

/path/to/zookeeper/bin/zkCli.sh -server 127.0.0.1:2181 set /clusterstate.json "`cat /local/path/to/clusterstate.json`"

Tuesday, July 9, 2013

Mismatched Solr 4.x versions with SolrCloud

I wanted to run a Solr 4.0 client against a more modern Solr server (like Solr 4.3.1 - shard splitting!). While we should match client and server versions, it's not unheard of to float a little bit. 4.0 to 4.3.1 is more than a little bit but I wanted to get Lily to work against 4.3.1 - notsomuch.

You'll likely get an error like this in the Solr ClusterState code:

org.apache.solr.common.cloud.ZkStateReader: Updating cluster state from ZooKeeper...
Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map

So something changed with the way the cluster is recorded in ZK/Solr. I encountered this again by accident when I forgot to update the Solrj version in my app's POM (again, trying to run a 4.0 client against a 4.3+ server).

This is probably only an issue with SolrCloud so remaining Solr classic users likely aren't affected. But if you're not using SolrCloud yet, you should take a look.

Sunday, May 5, 2013

My Weekend Project: trellminder (Trello email notifications)

Love trello! My wife and I use it to coordinate what needs to be done around the house. And I agree with the folks at trello that email notifications (and email in general) are lousy. I don't want to see each time a card changes - the only thing I want to be alerted when a task is about to be due or is past due. Mainly so I don't get in trouble at home ;-) So I thought it would be a fun weekend project.

I use CentOS at work so I'm most comfortable with Red Hat and I've played with OpenShift, their PaaS option, before. This feels like a cron job, which OpenShift supports. Trello has a nice API so this wasn't too hard to code in Python. I didn't want to make my Trello board public of course but you can generate a token which allows read-only access to a board, once approved.

Once you get an account on OpenShift, create a python-2.6 app and add the cron cartridge:

rhc app create trellminder python-2.6
rhc cartridge add cron-1.4 -a trellminder

I use requests whenever possible - urllib2 should be avoided at all costs. To do so, add "install_requires=['requests']" to setup.py in your root dir.

There was a very helpful question on the OpenShift forum which clued me into how this script should run. Mainly, I should use a shell script to activate the Python 2.6 virtenv then call the python script. The use of a 'jobs.deny' file would force cron to call the .sh and ignore the .py.

The actual source code is posted on my github repo.

Wednesday, May 1, 2013

Getting replicated Ehcache and iptables to play nice

Struggled with this a bit and thought others might find this useful. If you're using RMI Replicated caching with Ehcache, you need to put a little thought into port security/strategy. The sample ehcache.xml includes:

Using this config means you're going to have to poke holes in iptables for ports 40001 and 40002. All that is pretty simple - the gotcha is if you're using automatic peer discovery. It needs multicasting to work. The docs call this out but it took me awhile to realize I had to specifically allow it in iptables, as it is likely prohibited by default in most environments.

IBM has a nice post about how to do this. So in this example (replicated Ehcache with automatic peer discovery), you'll need this in iptables on each host you expect to participate:

-A INPUT -m state --state NEW -m tcp -p tcp --dport 40001 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 40002 -j ACCEPT
-A INPUT -m pkttype --pkt-type multicast -j ACCEPT

Saturday, April 20, 2013

Using Python and paramiko to move bits around

Automating the movement of files around a stack is one of those unglamorous jobs that just has to get done. It's certainly possible to do this in bash but Python makes it a little more easy and fun.

paramiko is an actively maintained Python module that assists with secure connections to remote hosts. This use cases favors the SFTPClient class which easily allows an app to connect to and put files on a remote machine. paramiko is LGPL and only requires Python 2.5+ and PyCrypto 2.1+.

It's probably best to create an account on the remote host with a very limited scope. That is, it only has permission to modify the directory which you intend to write to. Those creds will go in the Python script. Don't run this as root or your normal user account!

In this simple example, all files in a particular local directory will be recursively copied to a remote host. Enjoy!

Monday, February 18, 2013

Assembling ordered Lists using HashMap and Collections.sort()

LinkedHashMap won't work if that's what you were thinking....

I was presented with a use case having to assemble an ordered List of objects out of tens/hundreds of thousands of ordered elements. Here's the List<EventObject> I need to do something with:

Keys are not unique.
The List<EventObject> needs to become a List<SummarizedObject>.
1 or more EventObjects make up a SummarizedObject.
The order of List<EventObject> must be matched in the resulting List<SummarizedObject>. Meaning, the first EventObject will definitely be in the first SummarizedObject and the last EventObject will definitely be in the last SummarizedObject.
Typically, for every 10k EventObject values 100 unique keys and 200 unique SummarizedObjects are created (give or take 1 order of magnitude).
I'm not concerned about concurrency at this time.

LinkedHashMap<String, List<SummarizedObject>> wouldn't work for me even though it maintains order because a single key could have several SummarizedObjects (hence the List of SummarizedObjects) whose order could interweave with values in different keys. Bummer.

I kept coming back to the HashMap. Being able to quickly determine a key's existence in the Map and (if it does exist) going straight to that key's List<SummarizedObject> was ideal. So I dropped an order field in SummarizedObject, which looked like:

The pseudo-code for getting the EventObjects into a HashMap<String, List<SummarizedObject>> is:

So my last problem is to assemble all SummarizedObjects into a single, properly ordered List. Enter Collections.sort(). Here's real code:

Great performance and not that much code. And it can be optimized further by finding a better way to find the proper SummarizedObject in Step 5 of the above pseudo-code. Hope this helps someone!

Saturday, January 26, 2013

Marshaling POJOs with Jersey

When a web service has multiple consumers, you obviously don't want to duplicate a lot of code. So what if you want to push data to/from a Java API and a web site like jQuery? Jersey can help you do both with JSON. A lot of what I'll cover is in the Jersey docs.

Turn JSON support on in your servlet's web.xml:

<init-param>
  <param-name>com.sun.jersey.api.json.POJOMappingFeature</param-name>
  <param-value>true</param-value>
</init-param>

This example also uses compression so add the gzip filters too.
Annotate your POJO with JAXB. Make sure you have an empty constructor. I like to make the member variables private and force use of public getters. I also elect to ignore unknown properties.

@XmlRootElement(name = "foo")
@XmlAccessorType(XmlAccessType.PUBLIC_METHOD)
@JsonIgnoreProperties(ignoreUnknown = true)
public class Foo {
    private String blah;

    Foo() {}

    
    @XmlElement()
    public String getBlah() { return blah; }
}

Set up your web service. Since you're returning JSON, jQuery will be happy with basically no further consideration. Note I didn't define a setBlah() above but I hope you get the picture.

@GET @Path("/")
@Produces({MediaType.APPLICATION_JSON})
public Foo getFoo() {
  Foo foo = new Foo();
  foo.setBlah("Hello!");
  return foo;
}

Connect your WebResource properly. Note the addition of the JacksonJsonProvider class. The gzip filter is not required but you should probably use it.

ClientConfig config = new DefaultClientConfig();
config.getClasses().add(JacksonJsonProvider.class);
Client client = Client.create(config);
client.addFilter(new GZIPContentEncodingFilter(true)); 
WebResource webResource = client.resource(hostname);

Now request your POJO response (note the use of gzip here):

Foo foo = webResource.path("/")
               .header("Accept-Encoding", "gzip")
               .get(new GenericType<Foo>() {});

You can also pass POJOs to the web service using POSTs. This snippet assumes you annotated class Bar is a similar manner to Foo and the "/bar" POST endpoint accepts a Bar object.

Bar bar = new Bar("stuff");

Foo foo = webResource.path("/bar").post(new GenericType<Foo>() {}, bar);

Edited on 2/10/13 to include compression in example.

Sunday, January 6, 2013

"Multiple dex files" error with Eclipse + ADT

I haven't done much Android development in the last year but this week wanted to update my app a bit. So I installed Eclipse Juno (4.2), grabbed the latest Android Developer Tools (r21), and the latest AWS SDK for Android (1.4.4) since my app persists data on AWS.

These are all pretty big jumps from what I used to have - Eclipse Indigo, probably a single digit ADT, and AWS SDK 0.2.X IIRC. I've only been using my System76 laptop for grad school work of late, not "real" coding obviously. Almost everything worked fine. Thanks go out to these teams for honoring backward compatibility! The one problem I had was "Unable to execute dex: Multiple dex files define X" (X being something in the AWS package).

So off to StackOverflow. I tried all the suggestions:

Open/close Eclipse a few times, doing project cleans
Reinstalling ADT
Deleting my /bin and /gen directories, then cleaning
Made sure my build path was legit - several people traced it back to having /bin in the build path

The AWS JARs I imported were core, debug, and sdb (my app uses SimpleDB - don't ask). Since my testing only consisted of moving the .apk to dropbox and making sure it worked on my Galaxy S3, I didn't need the debug JAR. Once I removed it, everything worked okay.

Kinda perturbed this is still a problem - many SO posts and blogs I've seen are by people who appear to know what they're talking about so I don't think it's always a silly oversight by junior developers. I'm thinking it's a dependency problem - maybe AWS SDK 1.4.4 wasn't developed with ADT r21? If that was the case, Eclipse and SDK providers make it really hard to grab older versions like you would with a POM file. If I needed to debug in this instance, I'd be in real trouble.