Elevated publish errors on multiple Channels clusters

Incident Report for Pusher

Postmortem

On Friday 2019-12-06, a severe incident completely stopped some users from publishing events via Pusher Channels for up to 30 minutes. We pride ourselves on delivering a reliable service and know that our customers expect this from us. We want to apologize for what happened and offer a detailed explanation of what happened and the steps we are taking to avoid this from happening in the future.

What happened?

(All times in UTC.)

At 13:35, two engineers started rolling out a new version of Pusher Channels, which was intended to report metrics to a new internal metrics service. This was deemed to be low-risk change.

The new application version required new configuration to determine the address of the metrics service. This configuration is deployed to the instances independently of the application version. The deployment required the engineers to:

Set the new application version in a central service.
Roll out a new configuration version to every instance.
Run a deploy script against all instances, which starts new application processes from these configured versions.

Immediately after the deploy script was started:

Synthetic tests for “message delivery” started failing on all clusters.
As a result, on-call engineers were paged.
Load balancers rerouted most traffic to the still-healthy instances, which were at this point still running the old application version.

Once load balancers re-routed traffic away from the unhealthy instances, the error rate on most clusters was low.

At 13:40, an engineer began rolling back. This required:

Stopping the in-progress deploy script.
Rolling back the application version in the central service.
Re-running the deploy script.

Importantly, this procedure did not roll back the configuration version. Engineers assumed that the previous application version would be compatible with the new configuration version, because the application would just ignore the additional configuration for the new metrics service.

Engineers expected that error rates would reduce, but, by 13:48, some clusters with lower redundancy had a complete outage.

At 13:55 we discovered why the rollback was failing. No versions of the application were running, because the previous application version was incompatible with the new configuration version. Instead of ignoring the new configuration, the application’s policy was to refuse to start due to additional configuration that it did not expect.

Immediately after, an engineer began rolling back the configuration version. This required:

Stopping the current rollout.
Setting the configuration version to the original version.
Running the deploy script again.

It was assumed this would complete the rollback. However, on a small number of instances, the deployment refused to run due to a safety check: it detected that another deployment was happening simultaneously.

By 14:20, engineers understood that the failure of this safety check was spurious, and understood how to bypass it.

Between 14:20 and 14:55, multiple engineers manually connected to every unhealthy instance and forcibly restarted the application processes.

How will we ensure it doesn’t happen again?

Bad application code happens. But our deployment processes should ensure that these bad versions don’t reach end-users, and that if they do, it is quick and simple to roll back. In this incident, our deployment processes failed on both counts.

As an immediate mitigation in the short term, we will ensure that deployments are made cluster by cluster, with manual verification at each step.

We recognize that incidents are not the fault of the individual operator, but a problem with what the operator is required to do. In this incident, both the deployment and roll-back were made complicated by requiring two changes to be applied at the same time. Humans are inherently bad at such intricate, repetitive tasks and are bound to make mistakes. That’s why we want to change our deployments to be immutable. In an immutable infrastructure, code changes are shipped with their environment and rollbacks are easier to automate.

Note: this postmortem is a work in progress. We will provide further updates with more specific action items, as they are decided upon.

Posted Dec 09, 2019 - 18:40 UTC

Resolved

All clusters are operating normally. The faulty version is not running on any clusters.

A postmortem will follow.

Posted Dec 06, 2019 - 15:49 UTC

Monitoring

All instances on all clusters are now operating normally. We're monitoring.

Posted Dec 06, 2019 - 15:02 UTC

Update

We have identified the instances still running the faulty version. We are fixing these instances.

Posted Dec 06, 2019 - 14:52 UTC

Update

Most clusters operating normally. Some clusters still have a low, but elevated, error rate. We're investigating.

Posted Dec 06, 2019 - 14:38 UTC

Update

All multi-tenant clusters are operating normally. Some dedicated clusters are still affected; the fix is rolling out.

Posted Dec 06, 2019 - 14:26 UTC

Update

Error rate is decreasing on all clusters due to roll out of fix

Posted Dec 06, 2019 - 14:23 UTC

Update

Clusters still degraded: ap2, ap3, ...
We're rolling out a fix to these clusters.

Posted Dec 06, 2019 - 14:15 UTC

Update

We are continuing to work on a fix for this issue.

Posted Dec 06, 2019 - 14:07 UTC

Identified

Calls to the REST/HTTP API are failing on multiple clusters. The issue is identified and a fix is being rolled out.

Affected features:
- publishing/triggering events via the REST/HTTP API

Unaffected:
- presence
- client events
- webhooks

Posted Dec 06, 2019 - 14:00 UTC

This incident affected: Channels REST API.