Partial outage in MT1 cluster affecting customers using presence and presence webhook between 5:44 to 6:35 UTC

Incident Report for Pusher

Postmortem

Summary

On March 17th, at 05:48 UTC, on-call engineers were paged by our alerting system that there is a problem with the webhooks and increased api errors on MT1. API requests and webhooks related to presence channels members, and channel existence for 50% of apps using these features were affected until the incident was resolved at 06:37 UTC.

The affected Redis shard cluster is only responsible for specific features: API requests and webhooks related to presence channels members, and channel existence. There are two redis shard clusters in MT1. One of these shard clusters went down due to problems in the underlying infrastructure. Bringing this shard cluster back up resolved the incident.

Incident Timeline

At 05:44 UTC, a single Redis shard cluster on MT1 went down.
At 06:04 UTC, an engineer investigating the issue identified a possible solution and began implementing a resolution.
At 06:31 UTC, All nodes for the affected Redis shard cluster were replaced.
At 06:37 UTC, All affected systems were operational and the incident was resolved.

Root Cause

A single Redis shard cluster had repeated failures in doing full synchronisation from the primary instance to the replicas due to insufficient disk space on the primary instance. At some point, the request for full synchronisation from two replicas was aligned in a way that resulted in the memory of the primary instance constantly increasing until it went out-of-memory. At this point, the primary instance went down.

How will we ensure this does not happen again?

Check the disk space provisioned for Redis nodes and make sure they have sufficient disk space to support full synchronisation.
Add monitoring to get alerted when the disk space on Redis nodes is running low.
Investigate why partial synchronisation failed in the first place which resulted in the need for full synchronisation.

Posted Mar 20, 2023 - 14:33 UTC

Resolved

Between 5:44 UTC and 6:35 UTC, in MT1 cluster, we experienced a partial outage that affected 50% of clients using Channels presence feature and Channels presence webhook events. During this time, customers who attempted to query presence information from our API may have received a 500 error.

We will provide incident report with more information as soon as possible.

Posted Mar 17, 2023 - 06:00 UTC