On August 17th, at 09:33 UTC, on-call engineers were notified by the Customer Support team that there is a problem with the webhooks on SockJS servers on all multi-tenant clusters excluding MT1. No webhooks were delivered on these clusters for SockJS.
As part of an ongoing effort to ensure a more stable experience for our customers we have been moving our applications from self-managed Kubernetes clusters to EKS (Elastic Kubernetes Service). On August 1st we gradually began moving SockJS traffic to EKS.
During this procedure, a variable was misconfigured on the SockJS clusters. This resulted in disabling of webhooks for these clusters. On August 17th, following discovery of the issue, we rolled back all clusters to the old SockJS Infrastructure.
At 08:33 UTC, the customer support team reported an issue regarding webhooks not being sent for SockJS/XHR connections on the eu cluster.
At 08:37 UTC, an engineer investigating the issue identified a possible solution and began implementing a resolution.
At 09:30 UTC, the customer support team reported that the issue is still not fixed. The investigating engineer decided to escalate the priority of the report internally.
At 09:42 UTC, the customer support team determined that the problem was present on all the clusters except mt1.
At 09:56 UTC, engineers determined that the issue is linked to a rollout of new EKS SockJS infrastructure. They decided to move the traffic off of the new SockJS EKS infrastructure to the old infrastructure.
At 10:05 UTC, the customer support team reported that the issue was resolved on the eu cluster. All other clusters were also rolled back to the old infrastructure and testing confirmed that webhooks were once again functional for all clusters.
In the process of moving SockJS/XHR servers to EKS, a variable was misconfigured which led to disabling of writing to Kinesis. Kinesis is the basis of webhooks process and hence no webhooks were sent for these clusters.
We are reviewing all the variables required to ensure the proper operation of these servers. We are also adding the webhooks to the list of tests which need to pass before considering a cluster operational. This will allow us to detect issues earlier and prevent them making it into the production environment.