Summary

On Monday 11 April 2022 at 10:42 UTC the api for publishing messages to Channels was rendered largely unavailable for 11 minutes on the mt1 cluster. Some messages may have been lost during this time.

Background

Redis is a core component in our infrastructure and we’ve recently developed a new version of our Redis group module to remove some operational burden from our team. Each group comprises one master and two replicas. At the beginning of November 2021, we began to migrate all Redis groups to the new version.

All nodes in these new Redis groups have a script that runs during node termination. The script determines whether the node is a Redis master, and executes a failover to one of the replicas if so.

As part of our ongoing activities to harden our clusters to more reliably cope with increased traffic, we increased the size of the replication buffer in the Redis configuration, and applied that to each group in each cluster. However for that to take effect we then needed to restart each of the Redis nodes. We had previously executed this successfully on test clusters, and one of our smaller production clusters.

Incident timeline

At 08:47 UTC one of our engineers started replacing nodes in the mt1 cluster, and one at a time successfully restarted both replica nodes within one of the mt1 Redis groups.

At 10:42 UTC the engineer repeated the process for the master node. The failover was successfully initiated, however the Redis process was terminated before the failover had completed.

At 10:53 UTC the Redis group had recovered after a period where both old and new masters were contending for leadership, and the new master was successfully handling all messages.

Impact on end-users

Between 10:42 UTC and 10:53 UTC users may have experienced:

Failure to publish a message,
Messages arriving late (more than 900ms end-to-end latency), and
Messages not being delivered (lost).

Root cause

The shutdown script running on the Redis nodes assumed that once a failover command had been successfully executed it was safe to terminate the node. In most cases this is correct because failover happens sufficiently quickly that it does in fact complete before the Redis process is killed during node termination. However in our most heavily loaded clusters (such as mt1) this is not the case, and the node terminates during failover, leaving the group in an inconsistent state that takes longer to recover from than if failover had not been initiated at all.

How will we ensure this does not happen again?

We are fixing the Redis shutdown script to check that failover has completed and that the node is no longer a master before terminating.

Posted Apr 11, 2022 - 11:41 UTC

Resolved

From ~10:44 UTC - ~10:53 UTC we saw an increase in error rates returned for requests to the MT1 API. The issue has since been resolved and investigations into the root cause are underway.

Posted Apr 11, 2022 - 10:44 UTC