Elevated error rates on MT1

Incident Report for Pusher

Postmortem

Post-mortem: Partial connection and message delivery failure on mt1 cluster

On Thursday, November 11th at 15:39 UTC, a Redis replication issue caused a portion of Channels messages published on the mt1 cluster to fail. The incident occurred while two engineers were trying to migrate Redis clusters, and it lasted until 16:20 UTC when the replication issue was resolved by reverting traffic to old Redis clusters.

Background

Redis is a core component in our infrastructure and we’ve recently developed a new version of the Pusher Redis cluster to remove some operational burden from our team. At the beginning of November 2021, we began to migrate all Redis clusters to the new version. That migration was fully successful prior to this incident.

Incident timeline

(All times are in UTC on the 11th of November 2021)

At 13:55 UTC, two engineers started the Redis migration by adding three new Redis replicas to each cluster. Similarly to how we executed migration on other regions and clusters, the plan was to gradually shift traffic to new replicas and then disconnect old nodes from the cluster.

Starting from 15:39 UTC, one of the existing master nodes was unable to keep replicas in sync and went into a “resync loop” while engineers were continuing with migration on other Redis clusters.

At 15:49 UTC, our engineers began to shift traffic to the new nodes without realizing that the old master nodes had replication issues. A few minutes after shifting the traffic to new clusters, engineers noticed that the new master node was not able to keep replicas in sync.

Two other engineers responded to the incident and joined the migration team and at 16:20 UTC, the team switched traffic back to old clusters and disconnected all new nodes. The replication process became operational again as soon as the excess load was removed from the clusters.

What was the impact on end-users?

Between 15:39 and 16:20 UTC users may have experienced:

Failure to publish a message
Messages arriving late (more than 900ms end-to-end latency)
Messages not being delivered (lost)

What was the root cause?

While new Redis nodes had enough network bandwidth to support the usual load, we believe the master node couldn’t endure extra replicas connected during the migration, and that led to bandwidth saturation on the master node. The fact that traffic on the mt1 cluster was increasing at the time of migration was also a contributing factor.

What went wrong and how was it detected?

As we said earlier, Redis is one of the core components of our infrastructure, therefore we monitor all its metrics closely and have real-time dashboards and alerts. However, on the day of the incident, the engineers responsible for migrating the Redis cluster asked the rest of the engineering team to ignore all Redis related alerts during the migration because they were going to add new replicas initially, do a failover, and disconnect old replicas at some point which fires a few alerts in between.

We always run Redis clusters with 3 nodes. 1 master, and 2 replicas. And all Redis clusters are monitored by Redis Sentinel which provides high availability, leader election, replica selection, and more.

During the migration, an engineer would first deploy 3 new Redis nodes (soon to replace all old nodes) for each Redis cluster. These 3 nodes are automatically added to the target cluster as replicas, which means that, during the migration, the old master node has to keep 5 replicas in sync instead of the usual 2.

The mt1 cluster has many Redis clusters. Our engineers deployed all clusters at once and then started migrating Redis clusters one by one. That’s why there was a 2-hour gap between the time the new Redis cluster was deployed and the time that engineers attempted to shift traffic of the pub/sub cluster. Since the messaging cluster is also sharded (we have multiple pub/sub clusters on mt1) they decided to migrate them at the end. This was proven to be the wrong decision because at around 15:00 UTC is where we normally expect to have a large traffic increase on the mt1 cluster.

While the two engineers responsible for the migration were busy migrating other clusters, they didn’t realize that we had some early signs of replication issues on our metrics. One of the metrics that could help was the memory difference between shards and replicas which show the master node is struggling to keep replicas in sync.

When everything is going well, engineers issue a failover that asks Redis Sentinel to elect one of the new nodes as the master. Our engineers did that without checking the replication health on the old master, because exactly 10-minutes before they issued the failover, the old master was experiencing replication failure. Only after that did engineers watch metrics on new nodes and notice that the number of replicas was not right. The connection between master and one of the replicas was dropping and they were in the replication loop. It was only at that point that they invited other engineers and looked closely at the metrics of the old and new replicas.
What engineers first saw was that on the new master node, the `redis_connected_slaves` metric which shows the number of connected replicas was fluctuating. When the team looked into it further, they also saw that the `redis_commands_processed_total` doesn’t match between the replicas, and `redis_slowlog_length` on the master node was rising while the Redis used memory was capped at its max. This meant that the master was not able to keep replicas in sync, and was also struggling to transfer its output buffer.

The team used the `replica-priority` Redis config to make sure that the Sentinel elects the old master as the cluster lead. Then they connected the old replicas and disconnected all the new replicas and that resolved the issue immediately.

How will we ensure this doesn’t happen again?

There are a number of measures that we’ve decided to take.

The easiest and the most obvious action is to time Redis migration based on live traffic, and consider how long it will take to finish the migration. The second action is to use fewer replicas during the migration because the old nodes may not be able to cope with more replicas. The third action is to use oversized instances on larger clusters during the migration which will increase the cost on our side, but will result in a reduced chance of disruption for our users.

Another lesson that we have learned is to keep an eye on old nodes for early signs of trouble while we are migrating; an improvement that we want to make on the new version of our Redis cluster is to have better monitoring and alerting on replication health. We currently have all the metrics available on our dashboards and we will be defining better alerts to detect replication issues.

Posted Nov 23, 2021 - 14:31 UTC

Resolved

This incident has been resolved.

Posted Nov 11, 2021 - 16:21 UTC

Investigating

We are currently investigating this issue.

Posted Nov 11, 2021 - 16:16 UTC

This incident affected: Channels WebSocket client API and Channels presence channels.