MT1 Outage

Incident Report for Pusher

Postmortem

Public postmortem: MT1 outage - 4th Aug 2022

On 4th of August from 14:53 to 15:47 UTC users of our largest Channels cluster, MT1, experienced a major outage. This outage resulted in users being unable to publish messages using Pusher’s API, as the API was unavailable.

The root cause was determined to be a lack of resilience in our systems: in particular, the lack of proper mitigation of Redis failovers.

Below, we give a short summary of the specific incident that occurred between 14:53 to 15:47 on August 7th.

Incident timeline

All timestamps are in UTC

At 14:53 an Engineer performed a maintenance operation on one of the MT1 Redis clusters to resize the Redis instance. This resulted in a failure of the Redis clusters backing our MT1 Channels cluster.

14:53:30 All the API pods died due to detection of failed Redis.

14:54 api.pusherapp.com endpoint went down and started returning 503 HTTP status code.

14:59 The on-call engineer was paged and an incident began.

15:07 The engineers fighting the incident discovered the failure of the API pods.

15:10 Redis instances in MT1 cluster recover from failure.

15:11 api.pusherapp.com is back online and messages start to come through.

15:16 API pods were scaled up to twice the normal amount to compensate for the increased influx of messages.

15:16-15:47 The API pods came online in an inconsistent manner, with degraded performance during this period.

15:47 The API pods reached the desired number and performance was restored.

What was the impact on end-users?

Between 14:53 to 15:47 UTC users may have experienced:

Failure to publish a message

What went wrong?

There were several things that went wrong and caused such an outage:

The maintenance operation caused a failover in one of the Redis clusters backing MT1, as was expected. The Redis failover, however, affected all other Redis clusters in MT1, which was not expected. Our engineers are investigating why this happened.
The unavailability of Redis caused all of the API pods to fail simultaneously. This has made clear that our systems had not been sufficiently reacting to the failure of Redis. Our engineers are investigating resilience to ensure that when such a failure occurs, the failover is handled in a more graceful manner.
The simultaneous failure of all the API pods made restarting the pods difficult even when Redis became available again. The newly started pods were overwhelmed with requests and were killed. The number of pods started eventually caught up with demand, but this took some time.

How will we ensure this doesn’t happen again?

A chain of events caused this outage. Our engineers are investigating each event in isolation and will make changes to increase the resilience of each element, thus making the system more robust in general. The steps we have identified as fixes to be undertaken immediately include:

Fix readiness detection settings for the pods and likely adjust the moment when the API pod reports itself as unready.
Tune the pods to allow them to start up in a more flexible way and prevent the pods being started and killed due to an overflow of requests.

Posted Aug 16, 2022 - 13:32 UTC

Resolved

This incident has been resolved.

Posted Aug 04, 2022 - 15:48 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Aug 04, 2022 - 15:20 UTC

Update

We are continuing to investigate this issue.

Posted Aug 04, 2022 - 15:08 UTC

Investigating

We are currently investigating this issue.

Posted Aug 04, 2022 - 15:08 UTC

This incident affected: Channels REST API.