On Friday 2019-12-06, a severe incident completely stopped some users from publishing events via Pusher Channels for up to 30 minutes. We pride ourselves on delivering a reliable service and know that our customers expect this from us. We want to apologize for what happened and offer a detailed explanation of what happened and the steps we are taking to avoid this from happening in the future.
(All times in UTC.)
At 13:35, two engineers started rolling out a new version of Pusher Channels, which was intended to report metrics to a new internal metrics service. This was deemed to be low-risk change.
The new application version required new configuration to determine the address of the metrics service. This configuration is deployed to the instances independently of the application version. The deployment required the engineers to:
deployscript against all instances, which starts new application processes from these configured versions.
Immediately after the
deploy script was started:
Once load balancers re-routed traffic away from the unhealthy instances, the error rate on most clusters was low.
At 13:40, an engineer began rolling back. This required:
Importantly, this procedure did not roll back the configuration version. Engineers assumed that the previous application version would be compatible with the new configuration version, because the application would just ignore the additional configuration for the new metrics service.
Engineers expected that error rates would reduce, but, by 13:48, some clusters with lower redundancy had a complete outage.
At 13:55 we discovered why the rollback was failing. No versions of the application were running, because the previous application version was incompatible with the new configuration version. Instead of ignoring the new configuration, the application’s policy was to refuse to start due to additional configuration that it did not expect.
Immediately after, an engineer began rolling back the configuration version. This required:
It was assumed this would complete the rollback. However, on a small number of instances, the deployment refused to run due to a safety check: it detected that another deployment was happening simultaneously.
By 14:20, engineers understood that the failure of this safety check was spurious, and understood how to bypass it.
Between 14:20 and 14:55, multiple engineers manually connected to every unhealthy instance and forcibly restarted the application processes.
Bad application code happens. But our deployment processes should ensure that these bad versions don’t reach end-users, and that if they do, it is quick and simple to roll back. In this incident, our deployment processes failed on both counts.
As an immediate mitigation in the short term, we will ensure that deployments are made cluster by cluster, with manual verification at each step.
We recognize that incidents are not the fault of the individual operator, but a problem with what the operator is required to do. In this incident, both the deployment and roll-back were made complicated by requiring two changes to be applied at the same time. Humans are inherently bad at such intricate, repetitive tasks and are bound to make mistakes. That’s why we want to change our deployments to be immutable. In an immutable infrastructure, code changes are shipped with their environment and rollbacks are easier to automate.
Note: this postmortem is a work in progress. We will provide further updates with more specific action items, as they are decided upon.