On Monday, August 16th at 16:11 UTC, a Redis PUB/SUB shard replication issue caused around a third of Channels message publishes on the mt1
cluster to fail. The incident lasted until 16:43 UTC, when the replication issue was resolved.
After having investigated the root cause of the incident we have some mitigations planned for the tuning of Redis PUB/SUB replication, to prevent a reoccurrence of this issue in future.
Timeline
All times are in UTC on the 16th of August 2021:
- 16:11 UTC Our Engineering on-call team was paged due to a percentage of end to end test requests failing for the Pusher Channels product, the team immediately started an internal investigation.
- 16:38 UTC After analysing the Pusher Channels monitoring stack and noticing high latencies in writing to Redis from our services, the team identified an issue with the Redis replication process for one of our Redis PUB/SUB shards.
- 16:43 UTC The team increased the allocated memory for the Redis PUB/SUB shard and its replication process started succeeding again.