On Monday Mar 21, 2022 we observed that memory consumption for all Redis nodes on the MT1 cluster was erratic: it was going up and down near the VM’s limits.
The team started investigating and we suspected that we were hitting network limits on the primary node.
While we consulted with our account manager on our cloud provider to find out more about network limits we decided to force a failover of the primary node.
This failover happened very slowly and during this time we observed customer impact: publish requests to the API failed to write to Redis because there were no available Redis primaries and some customers saw error messages.
Upon further inspection of metrics we noticed that there had been some earlier impact when other failovers occurred automatically.
We saw short (less than 1m) bursts of errors around 17:20, 19:35, 20:00 and 21:00 UTC.
During this time replication to secondary Redis servers struggled to keep up, causing full syncs to be issued to secondary instances, keeping the background failure in a loop.
When overall traffic slowed down at around 21:00 UTC replication to secondary instances was able to keep up and no more network throttling occurred.
We decided not to make any more changes that day and to implement a real fix the next morning to add more Redis shards.
To enable deployment of more shards we needed to complete a staged deployment of WebSockets services to our Kubernetes cluster. At the time this was about 70% of WebSocket processes. During deployment of these additional resources to Kubernetes a second incident spawned.
When we increased the proportion of pods and traffic to the Kubernetes deployment there was a sudden increase of requests to the Kubernetes API. Limits were not properly set for these services and for some time the Kubernetes API became unavailable.
In the beginning of the deployment there were not enough channel API endpoints to respond to requests and we saw contention and slow responses for customers. From 12:02 till 12:18 UTC some customers saw timeouts and errors publishing events to the Channels API.
The team increased limits and changed instance types on the cluster which healed the Kubernetes API and allowed new deployments to be done. While this did not cause any additional customer impact, it needed to be resolved before we could deploy new Redis shards.
At around 14:00 UTC we observed the same erratic memory behavior on the Redis cluster. Possibly due to higher traffic observed at that time of the day.
At around 16:00 UTC the Kubernetes control plane was back to a healthy state and scaled accordingly, allowing us to continue with the redis shard deployment.
At around 16:40 UTC new shards were added to the cluster and resharding was started. Within minutes Redis memory came back to usual low and predictable levels.