January 2014 elasticsearch

Elasticsearch Snapshot Restore Overview

Elasticsearch v1.0 has nearly arrived. I recently wrote about the new Elasticsearch Aggregations feature, but that's not the only new addition to Elasticsearch v1.0: The Snapshot and Restore API is another useful new feature.

Backing up your elasticsearch cluster has until now being a bit of a pain. Elasticsearch 1.0 allows you to back up your cluster (or individual indices) easily, using the new Snapshot Restore API. At launch only the "shared filesytem" repository is available, but plans are already in place to support Amazon S3, HDFS/Hadoop, Gluster, Google Compute Engine and Azure.

Just like the new Aggregations Feature, it's available to preview in version 1.0-beta-2, so I've been having a play to see what we can expect from it.

Asgard: the cluster — Let's backup our cluster called "Asgard" that we created in my previous post: Using Elasticsearch on Amazon EC2.

Step 1: Snapshot

Firstly, we need to register a repository to contain the snapshot we will create. Like all functionality in Elasticsearch, it's easily accessible via the RESTful API. Simply make an HTTP PUT request to the _snapshot/your_repository_name endpoint.

$ curl -XPUT 'http://localhost:9200/_snapshot/asgard_backup' -d '{
    "type": "fs",
    "settings": {
        "compress" : true,
        "location": "/mnt2/elasticsearch/asgard/backup"
    }
}'

Ensure location points to a valid path, accesable from all nodes.

Here, we define a repository type of fs and the desired filesystem location in the JSON body of the request. This registers a repository called 'asgard_backup'.

Now that we have a repository registered, we can take a snapshot of our cluster by calling:

$ curl -XPUT 'http://localhost:9200/_snapshot/asgard_backup/snapshot_1?wait_for_completion=false'

The wait_for_completion flag controls if the request returns instantly, or waits for the snapshot to be completed before returning it's response.

The thing I'm most impressed about with snapshots is that they are incremental. Only the data changed since the last snapshot will be backed up - that's handy to keep the snapshot process quick, and the resulting dumps as small as possible.

Now we've added a snapshot called 'snapshot_1' to our 'asgard_backup' repository, we can see it by listing the contents of the repository.

$ curl -XGET 'http://localhost:9200/_snapshot/asgard_backup/_all'
{
  "snapshots" : [ {
    "snapshot" : "snapshot_1",
    "indices" : [ "company" ],
    "state" : "SUCCESS",
    "start_time" : "2014-01-03T15:38:47.086Z",
    "start_time_in_millis" : 1388763527086,
    "end_time" : "2014-01-03T15:38:47.132Z",
    "end_time_in_millis" : 1388763527132,
    "duration" : 46,
    "duration_in_millis" : "46ms",
    "failures" : [ ],
    "shards" : {
      "total" : 5,
      "failed" : 0,
      "successful" : 5
    }
  } ]
}

Step 2: Restore

Restoring from a snapshot is easy and a snapshot can be restored to any cluster. Unknown indexes will be created, but if an index already exists then the will have to be closed to avoid the potential of any data-corruption. I randomly deleted about half the documents then ran:

$ curl -XPOST 'http://localhost:9200/_snapshot/asgard_backup/snapshot_1/_restore'

This instantly returned a successful message. Admittedly this test only included a single index containing a mere 100 documents. I checked the logs to be sure:

[2014-01-03 15:50:38,205][INFO ][snapshots] [Thor] restore [asgard_backup:snapshot_1] is done

I checked the contents of my index again, and sure enough - the documents I had deleted were available again (and the index had automatically been reopened). Win. It's that easy. Another great improvement for Elasticsearch.

Elasticsearch version 1

So, there we have it: Painless snapshot and restore functionality available in Elasticsearch v1.0. This is just a brief summary, so feel free to ask any questions in the comments below, or fire me a tweet @simplechris