Saturday, June 7, 2014

A SolrCloud backup scheme

SolrCloud needs a real backup mechanism - there's an open ticket for this but until then users will have to make due with ReplicationHandler.  The HTTP API is really simple:
http://master_host:port/solr/replication?command=backup&location=/path/to/backup/dir
It takes a snapshot of the whole index at that path.  This is obviously targeted for legacy Solr, not SolrCloud.  So let's shoehorn this into SolrCloud by practicing with the following configuration in which a single collection has 4 shards spread over 4 hosts and each shard has a replica (so 2 total copies of each shard):

  • host1: mycollection_shard1_replica1, mycollection_shard4_replica2
  • host2: mycollection_shard2_replica1, mycollection_shard3_replica2
  • host3: mycollection_shard3_replica1, mycollection_shard2_replica2
  • host4: mycollection_shard4_replica1, mycollection_shard1_replica2
If each host is running a single Solr instance at port 8983, both replicas are being served from the same Solr process.  Replication commands need to declare what data it wants backed up.  More on that below.

4 calls to ReplicationHandler's HTTP API are required to snapshot all of mycollection.  It doesn't matter which replica gets hit but perhaps it's a good idea to distribute the work and to stagger the commands so the cloud isn't too bogged down.  That is, instead of pummeling host1 and host2, perhaps each host should have to backup 1 shard.  So 4 GETs:
http://host1:8983/solr/mycollection_shard1_replica1/replication?command=backup&location=/backups/060714/shard1
http://host2:8983/solr/mycollection_shard2_replica1/replication?command=backup&location=/backups/060714/shard2 
http://host3:8983/solr/mycollection_shard3_replica1/replication?command=backup&location=/backups/060714/shard3
http://host4:8983/solr/mycollection_shard4_replica1/replication?command=backup&location=/backups/060714/shard4
Note the location path includes a date attribute.  The HTTP call to /replication will return immediately but it might take several seconds or many minutes for the replication action to complete.  You can write some fancy daemon to figure out when it actually completes but why not write to some shared slab of disk and forget about it?

The data files will be uncompressed so you'll probably want to compress it at some point.  Rotation will be important because this isn't rsync - ReplicationHandler will write the whole thing each time.  In case you have a compatible configuration, note there is a "numberToKeep" param available.

Ultimately, your /backups/060714/ dir is left with 4 directories, each with the index files for a particular shard.  Recovery is really easy - just drop those files into the appropriate data/index/ directory.

1 comment:

  1. I cannot wait to dig deep and kickoff utilizing resources that I received from you. Your exuberance is refreshing. Hyper-V backup

    ReplyDelete