Between 11:34 and 19:30 on 18/03/2020 users requesting presence data or connection counts from the Pusher Channels HTTP API on cluster mt1 may have been served inaccurate or stale data. This was due to corrupted state caused by a misconfigured Redis Sentinel instance.
Pusher Channels stores presence channel state and channel subscriber state in a Redis instance spread across multiple shards. To ensure this state is available reliably, each of the shards is run with multiple replicas in different availability zones. Redis Sentinel is used to manage failovers between masters and replicas, and for configuration discovery.
(All time are in UTC)
At 11:34 a Channels engineer identified a Redis Sentinel instance in a broken state. The affected Sentinel was directing a small subset of Channels edge processes to a replica instance instead of the master, causing writes to fail. While attempting to perform a repair, the engineer misconfigured the Redis Sentinel process. This caused writes intended for one of the Redis shards to reach a different shard instead.
At 11:52 the Redis shards were correctly configured and all writes were reaching the correct shard. However the period of misconfiguration led to data inconsistencies in both shards and a significant increase in the memory usage of all Redis processes. The Redis memory allowance was increased until it approached the maximum acceptable on the machines.
At 14:25 all Redis processes were migrated to machines with more memory available in order to stabilise the system. Some users were still receiving incorrect presence data and subscriber counts.
Between 15:26 and 17:40 a solution to the data discrepancies was prepared and tested in a test environment
Between 18:25 and 19:29 the solution was applied to the mt1 cluster and the incident resolved.
We know operators makes mistakes and, with the scale of the Pusher systems, these mistakes have severe consequences for many people.
It took us too long to fix this problem and customers depending on presence suffered greatly. We are sorry for this and want to make sure we do better next time. Firstly, we need to make this less likely to happen again:
Longer term we need to ensure that when this happens again, we can react faster. We already know how we solve this specific issue, but we know there are problems overall with our presence system. We will look into making this system more resilient and self-healing.