Partial Webhook Outage For All Multi-Tenant Clusters Excluding MT1

Incident Report for Pusher

Postmortem

Summary

On August 17th, at 09:33 UTC, on-call engineers were notified by the Customer Support team that there is a problem with the webhooks on SockJS servers on all multi-tenant clusters excluding MT1. No webhooks were delivered on these clusters for SockJS.

As part of an ongoing effort to ensure a more stable experience for our customers we have been moving our applications from self-managed Kubernetes clusters to EKS (Elastic Kubernetes Service). On August 1st we gradually began moving SockJS traffic to EKS.

During this procedure, a variable was misconfigured on the SockJS clusters. This resulted in disabling of webhooks for these clusters. On August 17th, following discovery of the issue, we rolled back all clusters to the old SockJS Infrastructure.

Incident Timeline

At 08:33 UTC, the customer support team reported an issue regarding webhooks not being sent for SockJS/XHR connections on the eu cluster.

At 08:37 UTC, an engineer investigating the issue identified a possible solution and began implementing a resolution.

At 09:30 UTC, the customer support team reported that the issue is still not fixed. The investigating engineer decided to escalate the priority of the report internally.

At 09:42 UTC, the customer support team determined that the problem was present on all the clusters except mt1.

At 09:56 UTC, engineers determined that the issue is linked to a rollout of new EKS SockJS infrastructure. They decided to move the traffic off of the new SockJS EKS infrastructure to the old infrastructure.

At 10:05 UTC, the customer support team reported that the issue was resolved on the eu cluster. All other clusters were also rolled back to the old infrastructure and testing confirmed that webhooks were once again functional for all clusters.

Root Cause

In the process of moving SockJS/XHR servers to EKS, a variable was misconfigured which led to disabling of writing to Kinesis. Kinesis is the basis of webhooks process and hence no webhooks were sent for these clusters.

How will we ensure this does not happen again?

We are reviewing all the variables required to ensure the proper operation of these servers. We are also adding the webhooks to the list of tests which need to pass before considering a cluster operational. This will allow us to detect issues earlier and prevent them making it into the production environment.

Posted Sep 05, 2022 - 09:04 UTC

Resolved

We have now resolved the issue for all clusters and webhooks will be correctly fired for all clients.

Posted Aug 17, 2022 - 10:46 UTC

Identified

We have identified an issue preventing the correct delivery of webhooks for actions triggered by clients connected via our fallback connection protocols. Clients connected via SockJS or XHR Streaming will not trigger webhooks correctly on the following clusters:
us2, us3, eu, ap1, ap2, ap3, ap4, sa1.

Any clients connected via websockets, the primary connection protocol, are not affected.
Any clients connected to mt1, using any protocol, are not affected.

A fix has been identified and will be rolled out to each cluster.

Posted Aug 17, 2022 - 09:54 UTC

This incident affected: Channels Webhooks.