Elevated error rates on MT1 cluster

Incident Report for Pusher

Postmortem

Between 14:30 and 15:30 UTC on the 26th of January, clients of the Channels API endpoint may have received 502 or 504 responses. The issue continued through the day, until the issue was finally resolved at 23:11 UTC.

Note that the incident details on the status page has been edited to reflect the actual start of the incident. The incident was originally posted at 22:50. We want to apologise for the late acknowledgement and resolution of this incident. We know these issues happen and that our customers expect us to handle them efficiently.

A timeline of the incident

At 14:34 our monitoring system alerted of an elevated rate of 500 responses on the MT1 cluster. An engineer investigates and identifies a single node with high CPU load. The engineer replaces the node and monitors as the issue seems to be resolved.

The issue is escalated to engineering management, which (incorrectly) does not classify it as an incident, requiring a public incident on the status page.

Between 20:19 and 21:46 the same alarm triggered 2 times, for short bursts. These alarms are auto-resolved before an engineer investigates.

At 22:12, as a response to a customer inquiry, a Support Engineer escalates the issue again. The inquiry also mentions the EU cluster.

Between 22:12 and 23:00 several Engineers investigate. Initially, the investigation centres around the EU cluster, which is not experiencing issues.

Eventually, the issue is confirmed on the MT1 cluster and a problem is identified. The capacity of an autoscaling group has reached the limit. The limit is increased, and capacity is automatically increased.

What was the impact on end users?

Only a small fraction of requests (less than .5% at peak bursts) failed and these requests were immediately retried. While end users will not have been impacted, we take this issue very seriously. Our monitoring systems trigger at an early stage to prevent an actual outage.

Some customers, who publish at a relatively high rate (e.g. thousands per second) reported the 500 responses in their internal monitoring systems. At the same time, they report no impact on their end users. The main concern here is communication, where we set the bar high and need to adjust.

Why was the response so slow?

On the surface, the simplest explanation for the slow response is human error. 2 mistakes happened initially:

the initial investigation incorrectly concluded a single node at fault
the incident was not publicly acknowledged from the start

More than a story of human error, this was a problem of a low signal to noise ratio. Ideally, an elevated error rate should automatically alert Engineers and customers so everyone can take appropriate action. This was not feasible, as noise was making these alerts untrustworthy. Noise also made it difficult to identify the issue.

How will we ensure this doesn’t happen again?

We don’t want our customers to tell us we have problems before we acknowledge it. It’s embarrassing for us and a terrible experience for the customer.

After the incident we have reviewed the relevant metrics and made sure the alerts page the right people immediately. We’ve also made the relevant metric more prominent in our monitoring dashboards with a clear indication of violations.

Ensuring an efficient monitoring and incident response system is hard, continuous work and this is a stark reminder.

Posted Jan 29, 2021 - 12:42 UTC

Resolved

This incident has been resolved.

Posted Jan 26, 2021 - 23:11 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 26, 2021 - 23:00 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 26, 2021 - 14:30 UTC

This incident affected: Channels REST API.