On June 7th, from 14:18 to 14:26 UTC, users of our largest Channels cluster, MT1, experienced a major outage with the publish API.
We are currently investigating a resolution for the root cause of this issue. As this was a significant incident, it is important that we are as accurate as possible when reporting to you and for now we are assessing the total impact of the underlying issues of this incident.
Below, we give a short summary of the specific incident that occurred between 14:18 to 14:26 on June 7th. We will follow up with more details of the underlying issue and wider impact as soon as possible.
All times are in UTC
At 13:49 1 shard out of 6 became unavailable and delivery of messages started getting delayed.
Between 13:54 and 13:57 publish requests targeting this shard were failing. This amounted to up to 0.13% of the total publish requests.
At 14:10 an engineer was paged. The engineer identified the problematic shard. The root cause was identified as a Redis instance with high memory usage.
At 14:16 an engineer triggered a failover of the problematic Redis instance. This process was expected to complete within seconds, but because of a high data volume the replication took minutes.
Between 14:16 and 14:26 the API servers responsible for publishing were unable to connect to the failing Redis shard. This triggered a health check to fail and the kubernetes scheduler to restart the pods. However, the pods could not start while the shard was unavailable. This caused publishing to be completely unavailable.
At 14:26 the Redis failover procedure completed and the incident was resolved.
2 independent issues caused this issue to escalate:
The former issue is part of a wider problem which we have identified and are working to resolve. We are still assessing the total impact of this issue, but we are aware that it was caused by a specific pattern of inappropriate usage that we had not anticipated and failed to safeguard against. We will follow up with further details once we have finished our investigation and put full mitigations in place.
The latter issue caused publishing to become completely unavailable where it was initially limited to 1/6th of channels on the mt1 cluster. We are currently working on a fix for this, which will allow the service to continue in the event of a partial failure.