Dropped connections on mt1
Incident Report for Pusher

At 18:00 UTC on the 17th of December 50% of all WebSocket connections on MT1 were abruptly closed. Clients immediately started reconnecting. Upon re-connection, 50% of the connection attempts got 502 responses. All clients had successfully re-connected after a few minutes.

Note that the incident details on the status page have been edited to reflect the actual start time of the incident. The incident update was originally posted at 18:24, but engineers began working on mitigations immediately in response to the incident.

A timeline of the incident

On Dec. 16th, 13:00 Engineers gradually switched traffic for the MT1 WebsSocket endpoint towards a new Kubernetes deployment, eventually targeting 20%. The traffic was directed by weighted DNS.

On Dec. 17th, 13:00, after reviewing key metrics for the deployment, engineers started switching more traffic to the new deployment, targeting 70%.

At 15:00, one of the Kubernetes worker nodes was marked unhealthy. The ingress pods on the node had been killed by the OOM killer. This did not trigger an alert.

Between 15:00 and 17:45 more nodes were oomkilled until 60% capacity remained.

At 18:00, all the remaining ingress pods were oomkilled and all nodes were marked unhealthy. All WebSocket connections were closed and the load balancer started returning 502.

Between 18:00 and 18:05 clients were attemption re-connection until they were successful. DNS was still pointing 30% of the traffic to a healthy deployment. Engineers updated the DNS weights to move 100% towards the healthy deployment.

Some connections (< 10%) fell back to sockJS and the sockJS endpoints became overwhelmed for a short period. Latencies went up and more capacity was added. Latencies returned to normal levels after 15 minutes.

At 18:15 the incident was resolved.

What was the impact on end users?

50% of clients on MT1 lost connection for up to 5 minutes. Connections were re-tried immediately, and some clients failed to re-connect to WebSockets and fell back to sockJS. The sockJS clients may have experienced high latencies for a period of up to 15 minutes.

What went wrong?

The root cause of this incident was a mis-configured deployment, with insufficient memory and no liveness probe. This meant ingress pods were killed and not re-started. As more pods were killed, connections wre concentrated on the remaining pods until all were killed at the same time.

More importantly, this deployment did not have sufficient alerts to catch the problem before impacting users. At minimum, engineers should be paged when public-facing nodes are marked unhealthy for an extended period.

How will we ensure this doesn’t happen again?

We are now ensuring that engineers are paged on unhealthy (public) nodes. We are also going through any additional early indicators to discuss when engineers should be paged.

We are ensuring liveness checks are correctly configured for our Kubernetes deployments. In this case we were hesitant to add liveness probes as we didn’t want the scheduler to kill pods holding thousands of active connections. Unfortunately, the oomkiller did the job instead in this case.

Posted Jan 04, 2022 - 16:19 UTC

This incident has been resolved.
Posted Dec 17, 2021 - 19:02 UTC
A large portion of the dropped connections fell back to sockJS connections, and we ran out of sockJS capacity for a short period of time. We've added more capacity and are seeing latencies coming back to normal.
Posted Dec 17, 2021 - 18:59 UTC
A large number of connections were dropped on the mt1 cluster. We are investigating the issue.
Posted Dec 17, 2021 - 18:00 UTC
This incident affected: Channels WebSocket client API.