Increased Error Rates in our EU Cluster

Incident Report for Pusher

Postmortem

EU cluster outage

Between 15 September 15:00 UTC and 16 September 16:00 UTC clients may have experienced some issues connecting to the Channels API. Some may have seen total service disruption between 12:00 and 15:00 UTC on the 16th.

Please note that the incident details on the status page have been edited to reflect the actual start time of the incident. The incident was originally posted at 14:54 on 16 September. We would like to apologise for the delayed acknowledgement and resolution of this incident. We know that if these issues occur our customers expect us to handle them quickly and efficiently and this continues to be a major priority for our team.

A timeline of the incident

(All times are in UTC)

On 15 Sep. at 13:12 an engineer began increasing traffic to a new Kubernetes deployment as part of a gradual larger scale roll-out. At this time the servers in the deployment reached a limit and stopped accepting new connections. Affected clients would have seen 1006 errors on WebSocket connections until the client eventually reconnected to another server (not on k8s) or to our fallback sockJS infrastructure.

Between 15 Sep. at 13:12 and 16 Sep. at 12:00 engineers gradually rolled out more traffic to the new deployment until 70% of the EU traffic was directed towards the new deployment.

On 16 Sep. at 12:10 a majority of our traffic fell back to the sockJS infrastructure which was reaching capacity. This caused a major degradation and latencies rose to levels we consider unacceptable for our service.

On 16 Sep. at 14:20 an engineer was contacted directly by a user via a shared Slack channel. The user shared details about high latencies from the sockJS endpoints. The engineer confirmed that the sockJS infrastructure had reached capacity and started a manual intervention to add more capacity.

Between 14:25 and 14:38 more capacity was added to the sockJS infrastructure and latencies reduced.

On 16 Sep. at 15:00 the incident was resolved.

What was the impact on end users?

Between 12:00 and 15:00 on 16 Sep. up to 50% of the connections on EU cluster were affected, experiencing either multiple retries or complete failure to connect.

Why was the response delayed?

While our engineers were monitoring the roll out they were mainly focusing on the metrics for the new deployment. These metrics were overall healthy, and it was not clear that traffic was being rejected at such a high rate.

To get the full picture, engineers would have needed to monitor 3 separate systems at the same time; the new and old deployments, as well as the fallback infrastructure.

Throughout the incident our synthetic tests were flapping, triggering and auto-resolving. This was misinterpreted as intermittent network issues. Our synthetic tests mimic end user clients, but do not test every possible configuration of end clients. Critically, we did not have synthetic tests targeted exclusively at our fallback infrastructure.

How will we ensure this doesn’t happen again?

We don’t want our customers to tell us we have problems before we acknowledge it. It’s embarrassing for us and a terrible experience for our users.

Ironically, this incident happened as part of a bigger effort to improve our infrastructure and alerting. We are in the process of moving our legacy infrastructure to a more modern and resilient system based on Kubernetes. We are expecting to be able to significantly improve stability and performance when this migration is complete. Big changes like these can be risky and unfortunately on this occasion day we saw what’s at stake.

We have identified the issue to be a limit configured on the k8s deployment, which caused the pods to reject traffic beyond a certain number of connections. Our metrics in this area were also insufficient. We are fixing both issues.

More importantly, we are improving the synthetic tests and the escalation mechanisms involved so we can catch these incidents before they happen and fix them well before they affect users.

Posted Sep 17, 2021 - 16:22 UTC

Resolved

This incident has been resolved. A postmortem will be shared in due course.

Posted Sep 16, 2021 - 15:20 UTC

Monitoring

We have implemented a fix for the issue and are monitoring the results.

Posted Sep 16, 2021 - 15:01 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Sep 16, 2021 - 14:56 UTC

Investigating

We are investigating reports of increased issues with connections to our EU cluster.

Posted Sep 15, 2021 - 15:00 UTC

This incident affected: Channels WebSocket client API.