Code Gouge: November 2013

The lure of passive income is well documented but for many developers the costs outweigh the benefits. Once you charge money for a web or mobile app, a responsibility to customers is formed - you have to fix bugs, keep backend services highly available, stand up a customer service process, etc.

I like tinkering in mobile apps but don't want that pressure. I created an app to scratch an itch. It's not great but it works. It was published a couple years ago when not many apps with comparable functionality existed and has built up a group of steady users. I recognize many Android apps are undesirable but put mine one level above that category (the UI is not pretty).

So why should developers self-publish apps that don't make any money?

It's fun!

This is my primary reason. Don't you like seeing your software being used? My users' data gets stored remotely so the more rows I see added, the more I know people are using it. Also, I love getting emails from people who just say "thanks".
Great technical experience

I fancy myself as a full-stack engineer so this has been a great way to see something through from beginning to end. I get to touch lots of different technologies - client code, security layers, database endpoints, etc.
Do something different

I don't cook up mobile apps in my day job. At least now I can talk reasonably intelligently about Android, AWS, and Google App Engine.

Get the first one out of the way

Maybe one day I'll want to build something independently for profit. Ever get the feeling you don't know what you don't know? I feel like this experience will help me avoid that problem, at least partially.

I've been struggling to scale a use case I have for a large corpus of event data in Solr. It's kinda faceting but not really. facet.range, while awesome, doesn't get me there because I need specificity, not bins. Scale precludes simple sort. Consider the following events of people visiting locations:

docID, person, location, date
1, "John", "StoreA", 2013-01-01
2, "John", "StoreB", 2013-06-30
3, "John", "StoreA", 2013-07-11
4, "Frank", "StoreC", 2013-02-01
5, "Kim", "StoreA", 2013-08-01

The first row could read as "John visited StoreA on 1/1/13". I want to know data about each person's most recent trip to a particular location.

So let's say we facet on location "StoreA". Here's the HTTP request (shown instead of SolrJ code for the purpose of this blog post):

http://localhost:8983/solr/events/select?
q=location:StoreA
&rows=0
&facet=true
&facet.limit=-1
&facet.field=person
&facet.mincount=1

This executes a facet search for person which only returns facet_counts for events with location = "StoreA". For the above corpus, the applicable results are:

{ "John": 2, "Kim": 1 }

This is similar to what you'd expect on Amazon if you search on "television": you'll see a dozen or so Brands on which to drill into. But my request isn't just "Show me everyone who has been to StoreA", it's "For everyone who has been to StoreA, show me the most recent visit with the document ID". So what I'm really asking for is:

{ "John": ["2013-07-11", 3], "Kim": ["2013-08-01", 5] }

The scale (billions of records) and composition (people are usually seen at a small subset of locations) of my data makes it optimal to figure out who has been to StoreA before attempting to find their most recent visit. Most people haven't been to StoreA so eliminating them as quickly as possible is key.

It's been most performant to process the facet results then fetch the document with the most recent date for each hit (2 in this example). It would look something like this:

Issue the facet query - this will give you a list of the people who have been to "StoreA".
Concurrently fetch the info about each of these "StoreA" visitors. I use ExecutorService pools quite a bit:

http://localhost:8983/solr/events/select?
q=person:John+AND+location:StoreA
&sort=date+desc
&rows=1
&fl=docID,date

Even if you tune the run time of the facet query down to a couple seconds and each of the fetches in Step 2 take 10 ms, if you need to crunch through 10,000 of these matches, the response time isn't exactly instantaneous (over 5 seconds and probably closer to 10 seconds in this example). Plus, that's a ton of queries to send down range. You quickly get into a need to stage results or at least cache them. In fact, I've considered pre-processing schemes to focus my queries to people who have been seen at particular locations.

So I continue to experiment, trying to find the optimal solution. There has been lots of facet work in Solr of late so I hope that iteration leads to more tools at my disposal.

Code Gouge

Saturday, November 30, 2013

Publishing an app for fun, experience, and no profit

Monday, November 11, 2013

Temporally faceting through event data