Elevated 5XX and 4XX errors on mt1 cluster

Incident Report for Pusher

Postmortem

From 2019-12-12 to 2019-01-14, clients using Keep-Alive when making HTTP requests to the Channels REST API on our mt1 cluster could receive incorrect HTTP responses.

The main feature impacted by this issue was the REST query API: requests for channel metadata may have had incorrect responses. Core pub/sub message delivery was not affected by this issue, though requests may have had incorrect responses. Client-side features, including presence and client events, were not affected. Webhooks were not affected.

The underlying issue was that, under some conditions, our application server processes would send a single 4XX response multiple times, resulting in all subsequent responses on the connection being out of alignment with requests. This bug existed before 2019-12-12, but was masked because a proxy in front of those processes did not use Keep-Alive. This proxy was removed on mt1 on 2019-12-12, exposing this potential issue to end-users. The underlying issue was diagnosed on 2020-01-13 and fixed on 2020-01-14.

Timeline

(All times in UTC.)

2019-12-12 onwards: As part of a larger infrastructural project, engineers gradually shift REST API traffic on the mt1 cluster towards servers on new infrastructure. This cluster has a different set of proxies, all of which respect Keep-Alive.
2020-01-08+: some reports of HTTP requests getting error code 413 when they should not.
2020-01-13: an engineer succeeds in reproducing the issue and discovers the root cause.
2020-01-14 09:45: an engineer attempts to mitigate the issue by deploying a new proxy in front of the processes on the new infrastructure.
2020-01-14 14:00: metrics report elevated 5XX responses. An engineer diagnoses these as originating from the new proxy’s interaction with faulty application processes.
2020-01-14 15:00: Engineers begin shifting traffic away from the new infrastructure to the previous infrastructure serving REST API traffic on the mt1 cluster.
2020-01-14 19:30: No REST API traffic is reaching the faulty processes on the new infrastructure. The incident is resolved.

How will we ensure it doesn’t happen again?

The primary issue in this incident was slow response. Once we identified the issue, it took too long to roll back. We have identified two major reasons, each with its own planned mitigation:

When we identified the issue, we attempted to “fix forward” instead of roll back. To mitigate this future risk, we will implement a formal incident policy to “prefer roll back over fix forward”.
When we decided to roll back, our old infrastructure was at insufficient capacity for all mt1 REST traffic. We had to first increase capacity, and this slowed the rollback. To mitigate this future risk, we will keep the old infrastructure at full capacity until it is destroyed.

Posted Jan 27, 2020 - 10:15 UTC

Resolved

A postmortem will follow.

Posted Jan 14, 2020 - 19:38 UTC

Monitoring

All HTTP API traffic on mt1 is being directed to working instances.

We are still receiving a small number of requests to faulty instances. We believe this is due to clients not respecting the 60s TTL. Please clear your DNS cache.

We are monitoring our 4xx and 5xx rates on this cluster, and our traffic rates.

A postmortem will follow.

Posted Jan 14, 2020 - 19:07 UTC

Update

All DNS traffic is now being directed at working instances. Around ~10% of requests are still being received by faulty instances. We believe some of these are due to Keep-Alive; we are closing those connections to prompt clients to move over to working instances.

(Some other requests may be due to clients not respecting the 60s TTL; in this case users may need to clear their DNS caches.)

Posted Jan 14, 2020 - 18:48 UTC

Update

No users should be experiencing bad 5XX responses. Some users may still be experiencing 4XX responses. We are continuing to migrate traffic away from bad instances. We expect all traffic to be migrated within 60 minutes.

Posted Jan 14, 2020 - 18:34 UTC

Update

We are continuing to move traffic back to instances unaffected by the 4XX and 5XX issues.

Posted Jan 14, 2020 - 18:09 UTC

Identified

We have identified the cause of 4xx and 5xx responses. We are rolling back a failed deployment. We expect to see reduced error rates as traffic migrates.

We'll update this with more details soon.

Posted Jan 14, 2020 - 17:47 UTC

Update

We are continuing to investigate this issue.

Posted Jan 14, 2020 - 17:21 UTC

Investigating

We're seeing an increase in 5XX and 4XX errors on mt1 cluster and we're currently investigating the cause.

Posted Jan 14, 2020 - 15:22 UTC

This incident affected: Channels REST API.