mt1 high latency for messages
Incident Report for Pusher
Postmortem

On Monday, August 16th at 16:11 UTC, a Redis PUB/SUB shard replication issue caused around a third of Channels message publishes on the mt1 cluster to fail. The incident lasted until 16:43 UTC, when the replication issue was resolved.

After having investigated the root cause of the incident we have some mitigations planned for the tuning of Redis PUB/SUB replication, to prevent a reoccurrence of this issue in future.

Timeline

All times are in UTC on the 16th of August 2021:

  • 16:11 UTC Our Engineering on-call team was paged due to a percentage of end to end test requests failing for the Pusher Channels product, the team immediately started an internal investigation.
  • 16:38 UTC After analysing the Pusher Channels monitoring stack and noticing high latencies in writing to Redis from our services, the team identified an issue with the Redis replication process for one of our Redis PUB/SUB shards.
  • 16:43 UTC The team increased the allocated memory for the Redis PUB/SUB shard and its replication process started succeeding again.
Posted Aug 18, 2021 - 14:44 UTC

Resolved
The implemented fix resolved the incident.
Posted Aug 16, 2021 - 17:10 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 16, 2021 - 16:49 UTC
Investigating
We are currently investigating an issue with high latencies for messages on mt1
Posted Aug 16, 2021 - 16:42 UTC
This incident affected: Channels REST API and Channels WebSocket client API.