Thursday, March 24 at 11:55 UTC we saw an increase of 500 errors in the Channels REST API. The incident lasted 40 minutes, during which time some requests for new messages failed.
(All times are listed in UTC on the 24th of March 2022)
At 11:50, we started observing warnings related to latency.
At 11:55, we noticed an increase in the error rate on Channels API. Our incident responders raised an incident and we saw an increase in cpu utilisation on our Redis nodes.
At 12:45, the system was stable again.
Between 11:50 and 12:30 UTC users may have experienced:
A few days before the accident, we significantly scaled up the MT1 Kubernetes cluster, which increased load on Kubernetes control planes. The engineers made additional changes to the horizontal and vertical scale controllers and the API server to manage this. In the meantime, we decided to keep the old socket VMs alongside the new setup so that we could get traffic back to those instances in the event of any problems.
On March 24, as traffic on the MT1 clusters increased, Kubernetes was scheduling more socket pods to accommodate the load, while all older socket virtual machines continued running. This affected our Redis cluster and caused some latency problems.
To eliminate extra load from the Redis cluster, we simply removed some of the legacy socket VMs. We kept some of them running as part of our migration strategy.
We have added additional capacity to the MT1 cluster both vertically and horizontally. The MT1 cluster is now routing most of its traffic to our new infrastructure, which has been made more scalable and resilient.