Code Gouge: 2014

Sunday, August 24, 2014

CU-Boulder's CAETE program - my experience

When I left the military in 2011, I intended to make use of my GI Bill benefits. They are generous enough to pay for a graduate degree and I felt like I really needed that to make up for lost time spent outside the software development industry.

I wasn't keen on having to physically attend a classroom, having unsatisfactorily tried that before. So even though there are a few options in San Diego, an Internet-based program was my preference. With a job and a family, time efficiency is critical.

University of Colorado Boulder has a solid engineering program with a department called Center for Advanced Engineering and Technology Education (CAETE). I believe it's being re-branded as Engineering Anywhere - can't tell what's going on with the name.

In any event, I wanted to pursue the Masters of Engineering, Computer Science degree. You're given 6 years to complete 30 credits (basically 10 courses). I've gone with the coursework-only plan which just requires some breadth in my classes.

I've taken 1 class per Fall/Spring semester for the last 3 years. I'm about to start my 4th year with a plan to take 1 class this Fall, 2 in the Spring, and 1 in the Summer of 2015 to complete the program.

Overall, I've been extremely satisfied. The instructors have been great and the course content has been interesting and relevant to me.

Lectures conducted during the day in Boulder are almost always posted that night in time for me to watch them. My strategy is to watch the lectures and do any required reading during the weekdays and leave the weekends open to homework, project work, and studying for tests. I would guess I've been spending approximately 12 hours a week doing schoolwork.

The infrastructure needed to stream the lectures has steadily improved since I started the program. When I had DSL with 1 Mbps download speeds, things didn't always work so well but it's better now with 10 Mbps. It's usually a split screen with the instructor on one side and slides on the other. Sometimes it gets difficult to see what the instructor is writing on a classroom whiteboard but I think I've had only 1 course with the problem more than a couple times. They usually stick with writing on the slide deck which comes through fine.

Tests are performed with a proctor and you're given a time window to get it done. Never had a problem with this. I recall having to travel for work a couple years ago and needing to move my test around by a day or so and it wasn't an issue.

I get the most out of hand's on experience so the projects have been the most interesting to me. Sometimes the project gets demo'd to the instructor or TA via Skype or Hangouts which gives us a chance to talk about it. Everything has been in Java or Python with the exception of the networking class which was all in C.

So I'd recommend this program to anyone thinking about getting their CS graduate degree from an Internet-based program. The coursework has been appropriate and engaging. I definitely think it's helped me improve my software development knowledge. Since the GI Bill is paying my way I don't feel like I know enough about the financial costs to comment on the tuition level.

Here's a list of the classes I've already taken with a couple notes:

Fall 2011 - CSCI 5448: Object Oriented Design and Analysis

The perfect class to knock the rust off my Java skills

Spring 2012 - CSCI 5828: Foundations of Software Engineering

Spent a lot of time on concurrency which is a topic I enjoy

Fall 2012 - CSCI 5608: Software Program Management
Spring 2013 - CSCI 5817: Database Systems
Fall 2013 - CSCI 5832: Natural Language Processing

Favorite course so far - very interesting topic I didn't know much about

Spring 2014 - CSCI 5273: Network Systems

This class required the most effort mainly due to my novice C knowledge. But it was rewarding to complete the projects.

Saturday, June 7, 2014

A SolrCloud backup scheme

SolrCloud needs a real backup mechanism - there's an open ticket for this but until then users will have to make due with ReplicationHandler. The HTTP API is really simple:

http://master_host:port/solr/replication?command=backup&location=/path/to/backup/dir

It takes a snapshot of the whole index at that path. This is obviously targeted for legacy Solr, not SolrCloud. So let's shoehorn this into SolrCloud by practicing with the following configuration in which a single collection has 4 shards spread over 4 hosts and each shard has a replica (so 2 total copies of each shard):

host1: mycollection_shard1_replica1, mycollection_shard4_replica2
host2: mycollection_shard2_replica1, mycollection_shard3_replica2
host3: mycollection_shard3_replica1, mycollection_shard2_replica2
host4: mycollection_shard4_replica1, mycollection_shard1_replica2

If each host is running a single Solr instance at port 8983, both replicas are being served from the same Solr process. Replication commands need to declare what data it wants backed up. More on that below.

4 calls to ReplicationHandler's HTTP API are required to snapshot all of mycollection. It doesn't matter which replica gets hit but perhaps it's a good idea to distribute the work and to stagger the commands so the cloud isn't too bogged down. That is, instead of pummeling host1 and host2, perhaps each host should have to backup 1 shard. So 4 GETs:

http://host1:8983/solr/mycollection_shard1_replica1/replication?command=backup&location=/backups/060714/shard1

http://host2:8983/solr/mycollection_shard2_replica1/replication?command=backup&location=/backups/060714/shard2

http://host3:8983/solr/mycollection_shard3_replica1/replication?command=backup&location=/backups/060714/shard3

http://host4:8983/solr/mycollection_shard4_replica1/replication?command=backup&location=/backups/060714/shard4

Note the location path includes a date attribute. The HTTP call to /replication will return immediately but it might take several seconds or many minutes for the replication action to complete. You can write some fancy daemon to figure out when it actually completes but why not write to some shared slab of disk and forget about it?

The data files will be uncompressed so you'll probably want to compress it at some point. Rotation will be important because this isn't rsync - ReplicationHandler will write the whole thing each time. In case you have a compatible configuration, note there is a "numberToKeep" param available.

Ultimately, your /backups/060714/ dir is left with 4 directories, each with the index files for a particular shard. Recovery is really easy - just drop those files into the appropriate data/index/ directory.

Saturday, May 17, 2014

Metering tweets (a weekend project)

This project will probably take me more than a weekend but Twitter's new mute feature got me thinking - I don't need to mute people I follow but I would like to eliminate a lot of their fluff. I can recall at least 10 people I had to unfollow because even though I thought they sent out good stuff, for every good tweet there were 5 worthless ones.

The idea is this: I'd like only a user's best x tweets every day to show up in my feed. Let's define quality by quantifying what happens with a tweet (replies, retweets, favorites, etc). Lots of hand waving here obviously and this part isn't fully baked. Let's ignore this problem for now :-)

So every day an app combs through a user's tweets and picks out the best few tweets. How do I get those tweets into my feed? I'm sure as heck not going to build a new client, though that might be the smoothest implementation. I created a "shadow account" which I can send tweets from and if I follow it, content will show up in my timeline without having to build a separate client.

Everything I've described is fairly easy. Adding new users for the app to monitor, adjusting when the app evaluates tweets, how many to select, etc all require a web or mobile app. I'm not too keen on building that part. I'm hoping I'll know if this is a good idea before I get to this point - the app can be hard coded for now.

I'm hosting this on Openshift and writing it in Python. I've loaded a cron cartridge and envision adding a script to the daily bin. Twitter's API is pretty serious, especially auth, so I'm taking time to learn that. I found a nice little Python library that abstracts the application-only auth mechanism so I'll roll with that for now (Edit: switched to tweepy). My repo is public so feel free to take a look!

Monday, March 17, 2014

"Illegal instant due to time zone offset transition" with Joda-Time

Thanks to Daylight Saving Time, you might run into this error while using the excellent Joda-Time library:

java.lang.IllegalArgumentException: Cannot parse "2014-03-09T02:01:47Z": Illegal instant due to time zone offset transition (America/Los_Angeles)

I'm getting this pulling UTC timestamps out of Solr. My app's input and output will always be UTC. The offending code:

So Joda is trying to turn this into a time that never existed in Los Angeles (we sprung forward from 01:59:59 to 03:00:00).

Most answers on the web related to this error correctly indicate you should probably use LocalDateTime. But what if you want to remain UTC? Just tell the formatter to not revert to the local time zone. This makes the rest of the app safer as well because it's not handling time values which were translated from UTC to local.

Monday, January 20, 2014

Modifying the Solr QueryParser

If you're doing development in Solr trunk and want to adjust the QueryParser, take a look at the JavaCC grammar file at lucene/solr/core/src/java/org/apache/solr/parser/QueryParser.jj. This isn't a tutorial about JavaCC - there are plenty of those out there.

Once your changes are complete, you'll need to generate the underlying classes again. ant builds from lucene/ or lucene/solr/ don't accomplish this. So to do this, run 'ant javacc' from lucene/solr/core/.

That's it. Continue on with your normal build patterns - I often run 'ant example' from lucene/solr/ when debugging locally. It might make sense for parent builds to always run the javacc target in lucene/solr/core/ but I assume things are set up this way for a reason.

Saturday, January 4, 2014

Setting up a Solr dev environment

While I'm waiting for my next class to begin, I wanted to contribute some Solr code. I take so much from that project so it's only fair...

I use IntelliJ, SVN, and Maven for my non-Android development - Lucene/Solr has an SVN repo and it looks like some people use IntelliJ but the project seems to favor Ant over Maven but when looking through the Ant commands, there are hooks into Maven. Haven't had a chance to explore that much yet.

I run Mountain Lion so this is Mac OS X specific.

Good instructions exist on the Solr site:

http://wiki.apache.org/solr/HowToContribute

http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ

Checked out source:

svn co http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x branch_4x

Edit: Most development should occur in trunk. It'll get backported to the 4x branch as required.
Ran "ant idea" to generate the IntelliJ artifacts from branch_4x/ but got an error:

"branch_4x/lucene/common-build.xml:411: Ivy is not available"

No worries, downloaded Apache Ivy and put the jar where the docs said to:

sudo cp ~/Downloads/apache-ivy-2.3.0/ivy-2.3.0.jar /usr/share/ant/lib/
"ant idea" works now - took almost 8 mins probably due to my slow connection.
In IntelliJ 12 I ran "Open Project" on branch_4x/. Included at the end of "ant idea" are instructions to set up the project. I elected to create branch_4x/lucene/build.properties and included:

idea.jdk = project-jdk-name="1.6" project-jdk-type="JavaSDK"

Though it looks like we should be running Java 7? We'll see.

Edit: trunk is Java 7, branch_4x is Java 6.
So here's where it got a little tricky and it's mostly my fault for not reading carefully. Ran "ant example" from branch_4x/solr/.
Setup remote debugging in IntelliJ by creating a remote configuration. I took the docs recommendation of localhost:5900. Take the command line string and from branch_4x/solr/example ran:

java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5900 -jar start.jar

I resisted using example/start.jar because I didn't want to use the example, I wanted to use my stuff! That wasn't the case, duh.
From IntelliJ, debug the remote configuration that was just made. Browsed to localhost:8983 in a browser and it looked good.

Edit: You'll need to install Python3 as well to get "ant precommit" to complete successfully.