MT1 Outage - Partial connection and message delivery failure on mt1 cluster

Incident Report for Pusher

Postmortem

On June 7th, from 14:18 to 14:26 UTC, users of our largest Channels cluster, MT1, experienced a major outage with the publish API.

We are currently investigating a resolution for the root cause of this issue. As this was a significant incident, it is important that we are as accurate as possible when reporting to you and for now we are assessing the total impact of the underlying issues of this incident.

Below, we give a short summary of the specific incident that occurred between 14:18 to 14:26 on June 7th. We will follow up with more details of the underlying issue and wider impact as soon as possible.

The incident on the 7th

All times are in UTC

At 13:49 1 shard out of 6 became unavailable and delivery of messages started getting delayed.

Between 13:54 and 13:57 publish requests targeting this shard were failing. This amounted to up to 0.13% of the total publish requests.

At 14:10 an engineer was paged. The engineer identified the problematic shard. The root cause was identified as a Redis instance with high memory usage.

At 14:16 an engineer triggered a failover of the problematic Redis instance. This process was expected to complete within seconds, but because of a high data volume the replication took minutes.

Between 14:16 and 14:26 the API servers responsible for publishing were unable to connect to the failing Redis shard. This triggered a health check to fail and the kubernetes scheduler to restart the pods. However, the pods could not start while the shard was unavailable. This caused publishing to be completely unavailable.

At 14:26 the Redis failover procedure completed and the incident was resolved.

What went wrong?

2 independent issues caused this issue to escalate:

Some extremely high data volumes on a single shard caused the Redis instance to fail, but also replication and recovery to be slow.
The failure in one shard caused a health check to fail which caused all shards to become unavailable.

The former issue is part of a wider problem which we have identified and are working to resolve. We are still assessing the total impact of this issue, but we are aware that it was caused by a specific pattern of inappropriate usage that we had not anticipated and failed to safeguard against. We will follow up with further details once we have finished our investigation and put full mitigations in place.

The latter issue caused publishing to become completely unavailable where it was initially limited to 1/6th of channels on the mt1 cluster. We are currently working on a fix for this, which will allow the service to continue in the event of a partial failure.

Posted Jun 10, 2022 - 15:23 UTC

Resolved

This incident has been resolved. We will publish more details soon.

Posted Jun 07, 2022 - 14:39 UTC

Investigating

We are currently investigating this issue.

Posted Jun 07, 2022 - 14:25 UTC

This incident affected: Channels REST API.