Increased error rates in the US2 cluster API

Incident Report for Pusher

Postmortem

Elevated error rate in the US2 cluster - Channels API

Summary

On March 16th between 16:44 and 00:03 UTC on March 17th, we experienced an increased error rate in the US2 cluster. This resulted in higher than normal publish latency and a portion of the traffic receiving 5XX and timeout errors. The main cause was identified as bandwidth saturation, but a bug in the monitoring dashboard, as well as difficulty adding more resources, slowed the investigation and resolution.

The issue resurfaced on March 17th, between 15:43 and 21:23, impacting only up to 1% of the US2 cluster's traffic.

Note: All times mentioned in this postmortem report are in UTC unless otherwise specified.

Timeline

On March 16th, at 16:44, we received notifications about increased error rate in the US2 cluster. The on-call engineers were quickly paged and began investigating.

It was observed that publish latency had increased and a portion of the traffic was receiving 5XX and timeout errors.

Our engineers initially suspected an issue with Redis, as some containers were restarting due to connection issues. However, after checking the Redis health dashboard, they found no abnormalities.

It was also confirmed that the ongoing issue was not related to recent deployments.

Further investigation indicated that the problem lay within our Redis clusters. Every channel cluster uses multiple Redis clusters for various responsibilities, and engineers observed that the application was reporting connectivity errors for two different Redis clusters.

At 17:45, a manual failover was performed in one of the Redis clusters, resulting in slight improvement, but the issue remained unresolved.

At 18:10, more engineers joined the incident response team and discovered that the problem was due to bandwidth saturation, as seen on the AWS monitoring dashboard. Our internal monitoring dashboard was underreporting bandwidth values due to a bug in byte conversion, which had prevented the incident responders from identifying the problem earlier.

19:14 UTC, a decision was made to resize the affected Redis nodes using larger instance sizes. By this time, all instances had been refreshed. While containers stopped reporting connectivity issues in their logs, issues still persisted in our messaging Redis cluster.

The incident response team quickly decided to add additional shards for Redis messaging. However, this was the first time that resharding was required in our AWS EKS environment, as we had recently migrated the US2 cluster from our self-managed Kubernetes infrastructure. Engineers had to introduce the necessary parameters in the configmaps for our EKS cluster, which took some time.

The resharding is a slow multi-phase process, but it allows us to make changes to a production cluster without losing any writes. We have a long grace period to drain socket connections, ensuring that similar operations would not impact a large portion of our traffic.

Preparations for resharding were completed by 21:15 UTC, and the first phase of resharding started. By 00:03 UTC, resharding was completed, and everything was operational.

On March 17th, while the team was still investigating the first incident, the issue resurfaced at 15:43 UTC, but on a much smaller scale, affecting up to 1% of the traffic. The team quickly responded by adding more capacity to the cluster.

Root cause

Redis nodes in the affected cluster had been operating beyond their full network baseline capacity during peak hours for some time, but we had not been proactively monitoring this. AWS provides burst network performance, and evidently we had always had enough network I/O credits to allow instances to use burst bandwidth. These instances also earn network I/O credits during off-peak hours in the US2 cluster. We never rely on I/O credits, but we believe a combination of these factors contributed to us not noticing capacity issues in the cluster until now.

The incident occurred when we experienced an unusual amount of traffic. The instance burst is on a best-effort basis, even when the instance has credits available. As burst bandwidth is a shared resource, it is never guaranteed that AWS can allocate it, and we believe that's precisely what happened that night to a few of our Redis nodes.

Further investigations by our engineering team led to the identification of a new theory that suggests one of our high volume customers may be triggering an edge case bug in our system due to a unique combination of factors in their use case after a recent change they made in their application. We require additional data to confirm this theory. Our next course of action is to isolate their traffic and reduce the load on the US2 cluster and continue with our investigation. We will provide timely updates as we make progress in this investigation.

How will we ensure this does not happen again?

To prevent similar incidents from happening in the future, we have taken several steps.

Firstly, we will implement monitoring to identify nodes that regularly exceed their capacity during peak hours. This will allow us to perform a right-sizing exercise and update instances that have similar capacity issues more frequently. Our long-term goal is to automate this process to ensure continuous optimization.

Secondly, we will fix the bug in our Redis dashboard to ensure that network performance is accurately reported. Additionally, we will add native AWS metrics to our Grafana dashboard to provide better visibility into our infrastructure's network performance.

Posted Mar 28, 2023 - 15:27 UTC

Resolved

This incident has been resolved. We will be sharing a detailed incident report in the near future.

Posted Mar 17, 2023 - 00:36 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 16, 2023 - 22:47 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Mar 16, 2023 - 21:43 UTC

Update

We have made a configuration change on the cluster but this did not resolve the increased error rates. We are continuing to investigate the issue.

Posted Mar 16, 2023 - 20:20 UTC

Investigating

We are aware of the increased error rates in the US2 cluster API. Our team is currently investigating this issue.

Posted Mar 16, 2023 - 17:27 UTC

This incident affected: Channels REST API.