ElastiCache Redis is a powerful KV store offering both in-memory and persistent store for Key Values.When we persist data in ElastiCache Redis, it is obvious that we need to constantly backup Redis to recover the data during failures. By default there is no option to easily create the snapshot from the running ElastiCache Redis cluster. We need to install Redis on Self Managed EC2 node and point it to ElastiCache Redis Master for the snapshot process. This post explains about the step by step method for Redis cluster snapshot creation.
Redis Snapshot Architecture :
ElastiCache cluster with master redis node of size cache.m1.large in AZ1 and Replication group with one read replica node of size cache.m1.large in AZ2
Self managed Redis installed on EC2 for snapshots on AZ1
Stage 1: Creating a Snapshot for ElastiCache redis Persistent store.
Step 1) Launch an EC2 instance on the same Availability zone where the ElastiCache Redis master node runs.
If the Redis Snapshot EC2 node is temporary (or) replication/Backup performance is very critical then keep the Redis Snapshot node in the same AZ as the ElastiCache Redis Master Node.
If Redis Snapshot node is going to permanently run and high availability of the backup mechanism is very critical, run this on different AZ from master redis node.
Step 2)Download and install the Redis software on your newly launched EC2 instance. Make sure the Redis snapshot ec2 version and ElastiCache Redis Master version are compatible.
$ wget http://redis.googlecode.com/files/redis-2.6.4.tar.gz
$ tar xzf redis-2.6.4.tar.gz
$ cd redis-2.6.4
$ make
Step 3)Start the Redis Server on the newly launched EC2 instance. It will be listening on the port 6379
# src/redis-server
[1513] 10 Jan 10:25:45.083 * Max number of open files set to 10032
[1513] 10 Jan 10:25:45.085 # Server started, Redis version 2.6.4
Step 4)Open another terminal window and connect locally to the Redis Snapshot EC2 instance
# src/redis-cli -h localhost -p 6379
redis localhost:6379>
Step 5)To synchronize the data between the local Redis Snapshot EC2 instance and the ElastiCache Redis master use the following command in the Redis Snapshot EC2.
SLAVEOF redissnapshot.qcdze2.0001.usw2.cache.amazonaws.com 6379
where "redissnapshot.qcdze2.0001.usw2.cache.amazonaws.com 6379" is the endpoint of the ElastiCache Master Redis node.
Bringing a Additional Redis EC2 node for snapshots into the infrastructure is not ideal. It adds extra cost and manual labor effort to setup, automate and manage this Redis Snapshot EC2. I hope AWS ElastiCache team will release a mechanism where we can easily take the snapshots from the ElastiCache Cluster itself.
Note: Before this step ensure that security group permissions are opened between the Redis Snapshot EC2 and ElastiCache Redis Cluster.
Step 6)Once the synchronization is done between Redis Snapshot node and ElastiCache Redis Master, you can take snapshot using the command BGSAVE or SAVE in the snapshot EC2 Node. The snapshot file named "dump.rdb" is created in the disk.
Step 7) You can optionally detach it from the ElastiCache cluster by using the command on redis-cli “SLAVEOF NO ONE”. This step is recommended if your redis cluster size is few GB's and you are backing up persistent data once/4 times a day, as you can save costs by not unnecessarily running your Redis snapshot EC2 instance for 24 hours and automatically bring it up whenever needed.
We conducted a small test to check whether all the above steps are successfully reflected.
We pumped the ElastiCache Redis cluster with 1.8GB of Data spanning few hundred millions of KV data with each record in few <KB size. We observed the time taken to sync the data is about ~5 minutes on m1.large capacity in same AZ deployment.
If your Redis is heavily used it is better to use Higher EC2 capacities for the ElastiCache Cluster and Snapshot node. Bigger the nodes, better the NW/IO and lesser the replication lag.
If you are running a small cluster with few GB's then creating the Redis Snapshot Node on demand will save costs. If your cluster is big, data is critical and backup frequencies are less, then run Snapshot EC2 node continuously for better performance.
The replication action with timing is detailed below:
[1595] 10 Jan 11:03:06.676 # Server started, Redis version 2.6.4
[1595] 10 Jan 11:03:06.676 * The server is now ready to accept connections on port 6379
[1595] 10 Jan 11:03:18.018 * SLAVE OF redissnapshot.qcdze2.0001.usw2.cache.amazonaws.com:6379 enabled (user request)
[1595] 10 Jan 11:03:18.779 * Connecting to MASTER...
[1595] 10 Jan 11:03:18.975 * MASTER <-> SLAVE sync started
[1595] 10 Jan 11:03:18.975 * Non blocking connect for SYNC fired the event.
[1595] 10 Jan 11:03:18.977 * Master replied to PING, replication can continue...
[1595] 10 Jan 11:06:03.949 * MASTER <-> SLAVE sync: receiving 1943998490 bytes from master
[1595] 10 Jan 11:08:13.348 * MASTER <-> SLAVE sync: Loading DB in memory
[1595] 10 Jan 11:09:37.430 * MASTER <-> SLAVE sync: Finished with success
[1595] 10 Jan 11:09:37.441 * 100 changes in 300 seconds. Saving...
[1595] 10 Jan 11:09:39.970 * Background saving started by pid 1635
[1595] 10 Jan 11:16:44.860 # User requested shutdown...
[1595] 10 Jan 11:16:44.860 * Saving the final RDB snapshot before exiting.
[1595] 10 Jan 11:19:10.270 * DB saved on disk
[1595] 10 Jan 11:19:10.270 # Redis is now ready to exit, bye bye...
Test whether the same key values are synced between Master Redis Node in ElastiCache Cluster and Redis Snapshot EC2. The below steps illustrate the same:
# src/redis-cli -h redissnapshot.qcdze2.0001.usw2.cache.amazonaws.com -p 6379
redis redissnapshot.qcdze2.0001.usw2.cache.amazonaws.com:6379> get 100sent
"1001234567890123456789012345678901234567890abcdefghijklmopqrstuvwxyz"
redis redissnapshot.qcdze2.0001.usw2.cache.amazonaws.com:6379> get 10sent
"101234567890123456789012345678901234567890abcdefghijklmopqrstuvwxyz"
Important Observation:
Always sync the Redis Snapshot EC2 node with the ElastiCache Redis master node and not with the read slave in the redis replication group. Following error will be displayed when trying to sync with the slave.
#Connecting to MASTER...
#MASTER <-> SLAVE sync started
#Non blocking connect for SYNC fired the event.
#Master replied to PING, replication can continue...
# MASTER aborted replication with an error: ERR Can't SYNC with a slave
We tried the above approach to check whether we can reduce the workload on Redis Master node and take the backups from Slave redis Nodes. As AWS as rightly named it is " Redis Read Replica Nodes" and you cannot use it for snapshot replication.
Stage 2: Launching ElastiCache Redis Cluster from the existing Snapshot taken above.
Step 1) Import the data to the S3 bucket (redisaws) and assign open/download permission for the email id aws-scs-s3-readonly@amazon.com. Note : The aws-scs-s3-readonly account is used exclusively for customers uploading Redis snapshot data from Amazon S3.
Step 2) Launch a new ElastiCache Redis cluster from the "dump.rdb" snapshot taken in the earlier stage 1 - step 6 from Redis Snapshot node. Lets understand what is RDB in detail.
RDB is a very compact single file point in time representation of your Redis data and are perfect for backups. Being a single compact file it can be transferred to alternate AZ and AWS regions quite fast and is ideal during outages/DR. RDB also allows faster restarts of Redis Master with big datasets in new ElastiCache Clusters nodes. Both these points helps us achieve better RPO and RTO during DR scenario using RDB.
RDB needs to fork() often in order to persist on disk using a child process. Fork() can be time consuming if the dataset is big, and may result in Redis Master to stop serving clients for some millisecond or even for one second if the dataset is very big and the CPU performance is not great. Also since the Redis Snapshot instance in this architecture is not used for accepting reads/writes and it is primarily used only for replication + snapshot it will add again minimal load on the master node during replication. Factor the ElastiCache Master Redis node capacity taking both these parameters into consideration.
Note : RDB is NOT good if you need to minimize the chance of data loss in case Redis Master stops working. It is suggested to use RDB model with ElastiCache Redis Replication group cluster for better Availability and integrity.
Step 3) Once the ElastiCache cluster is launched. Connect to the cluster endpoint and check for the existence of the same key values tested in the dump.
src/redis-cli -h rds.qcdze2.0001.usw2.cache.amazonaws.com -p 6379
redis rds.qcdze2.0001.usw2.cache.amazonaws.com:6379> get 10sent
"101234567890123456789012345678901234567890abcdefghijklmopqrstuvwxyz"
redis rds.qcdze2.0001.usw2.cache.amazonaws.com:6379> get 100sent
"1001234567890123456789012345678901234567890abcdefghijklmopqrstuvwxyz"
You can notice from the above that the key values are same and the new ElastiCache node was created properly with original data.
This post was co authored with Senthil 8K Miles