On Saturday March 30th from 04:15 UTC until 11:30 UTC Pusher channels experienced a partial outage and higher than usual latencies on our US2 cluster. The incident was triggered by a single, high volume customer sending requests that was causing an authentication error. These errors are collated and stored in our error logging system, which can be viewed on the Pusher Dashboard.
Due to the exceptionally high number of errors, the logging system was overwhelmed, and caused all writing to the log system to backlog on our socket processes. This in turn resulted in the socket processes taking longer to process messages, and to start failing health checks. The failing health checks caused pods to restart, which resulted in connections being dropped.
We identified this issue and took steps to mitigate, attempting to switch off non-critical components of the system to alleviate stress on the system. Eventually it was clear that these steps were having little impact and we deployed a hotfix to disable the error logging system. During the deployment of this hotfix the system hit a limit on the amount of registered targets on our cloud provider’s load balancer. This caused the rolling deployment to take much longer than expected, as it waited for old socket processes to be fully drained and terminated.
After coordination with our cloud provider we were able to increase this limit, allowing the deployment to complete and the incident was resolved.
There were 2 main windows of impact
From 4:15 till 5:40 UTC and From 7:15 till 9:15 UTC
From 9:15 till 11:30 service was operating normally and we saw a ramp up in connections.