On 4th of August from 14:53 to 15:47 UTC users of our largest Channels cluster, MT1, experienced a major outage. This outage resulted in users being unable to publish messages using Pusher’s API, as the API was unavailable.
The root cause was determined to be a lack of resilience in our systems: in particular, the lack of proper mitigation of Redis failovers.
Below, we give a short summary of the specific incident that occurred between 14:53 to 15:47 on August 7th.
All timestamps are in UTC
At 14:53 an Engineer performed a maintenance operation on one of the MT1 Redis clusters to resize the Redis instance. This resulted in a failure of the Redis clusters backing our MT1 Channels cluster.
14:53:30 All the API pods died due to detection of failed Redis.
14:54 api.pusherapp.com endpoint went down and started returning 503 HTTP status code.
14:59 The on-call engineer was paged and an incident began.
15:07 The engineers fighting the incident discovered the failure of the API pods.
15:10 Redis instances in MT1 cluster recover from failure.
15:11 api.pusherapp.com is back online and messages start to come through.
15:16 API pods were scaled up to twice the normal amount to compensate for the increased influx of messages.
15:16-15:47 The API pods came online in an inconsistent manner, with degraded performance during this period.
15:47 The API pods reached the desired number and performance was restored.
Between 14:53 to 15:47 UTC users may have experienced:
There were several things that went wrong and caused such an outage:
A chain of events caused this outage. Our engineers are investigating each event in isolation and will make changes to increase the resilience of each element, thus making the system more robust in general. The steps we have identified as fixes to be undertaken immediately include: