I've been struggling to scale a use case I have for a large corpus of event data in Solr. It's kinda
faceting but not really.
facet.range, while awesome, doesn't get me there because I need specificity, not bins. Scale precludes simple
sort. Consider the following events of people visiting locations:
docID, person, location, date
1, "John", "StoreA", 2013-01-01
2, "John", "StoreB", 2013-06-30
3, "John", "StoreA", 2013-07-11
4, "Frank", "StoreC", 2013-02-01
5, "Kim", "StoreA", 2013-08-01
The first row could read as "John visited StoreA on 1/1/13". I want to know data about each person's most recent trip to a particular location.
So let's say we facet on location "StoreA". Here's the HTTP request (shown instead of
SolrJ code for the purpose of this blog post):
http://localhost:8983/solr/events/select?
q=location:StoreA
&rows=0
&facet=true
&facet.limit=-1
&facet.field=person
&facet.mincount=1
This executes a facet search for person which only returns facet_counts for events with location = "StoreA". For the above corpus, the applicable results are:
{ "John": 2, "Kim": 1 }
This is similar to what you'd expect on Amazon if you search on "television": you'll see a dozen or so Brands on which to drill into. But my request isn't just "Show me everyone who has been to StoreA", it's "For everyone who has been to StoreA, show me the most recent visit with the document ID". So what I'm really asking for is:
{ "John": ["2013-07-11", 3], "Kim": ["2013-08-01", 5] }
The scale (billions of records) and composition (people are usually seen at a small subset of locations) of my data makes it optimal to figure out who has been to StoreA before attempting to find their most recent visit. Most people haven't been to StoreA so eliminating them as quickly as possible is key.
It's been most performant to process the facet results then fetch the document with the most recent date for each hit (2 in this example). It would look something like this:
- Issue the facet query - this will give you a list of the people who have been to "StoreA".
- Concurrently fetch the info about each of these "StoreA" visitors. I use ExecutorService pools quite a bit:
http://localhost:8983/solr/events/select?
q=person:John+AND+location:StoreA
&sort=date+desc
&rows=1
&fl=docID,date
Even if you tune the run time of the facet query down to a couple seconds and each of the fetches in Step 2 take 10 ms, if you need to crunch through 10,000 of these matches, the response time isn't exactly instantaneous (over 5 seconds and probably closer to 10 seconds in this example). Plus, that's a ton of queries to send down range. You quickly get into a need to stage results or at least cache them. In fact, I've considered pre-processing schemes to focus my queries to people who have been seen at particular locations.
So I continue to experiment, trying to find the optimal solution. There has been lots of facet work in Solr of late so I hope that iteration leads to more tools at my disposal.