Increased latency - US2 cluster

Incident Report for Pusher

Postmortem

Root Cause Analysis: Increased Latency and Message Delivery Failures – US2 Cluster

Incident Date: October 19, 2025

Duration: 00:44 UTC – 08:17 UTC

Status: Resolved

Summary

On October 19, 2025, customers using Pusher experienced increased latency and message delivery failures. These issues primarily affected the US2 cluster, with intermittent impact also observed in the MT1 and US3 clusters.  The incident resulted in delayed or undelivered messages for many applications.

Latency stabilized at 08:17 UTC after mitigation actions were completed.

Impact

Between 00:44 UTC and 08:17 UTC, multiple customers experienced:

  • Delayed or failed message delivery across affected clusters
  • Degraded performance in connection establishment and publishing

The most significant and prolonged impact occurred in the US2 cluster, while MT1 and US3 clusters saw elevated latency for a shorter period before stabilizing.

Root Cause

The primary cause of the incident was IP address saturation within the subnet assigned to the public Pusher clusters.

When traffic levels increased, the US2 cluster was unable to scale out further because the available IP addresses in its subnet were fully utilized. This IP scaling limitation prevented the creation of additional instances needed to handle the load.

Secondary factors included temporary network saturation and capacity limits at our cloud provider, which amplified the latency in the early stages of the incident.

Detection and Response

The issue was first detected through a combination of customer reports and internal monitoring alerts showing elevated response times and connection errors.

The timeline of actions was as follows:

  • 00:44 UTC – Monitoring alerted the team to increased latency across MT1, US2, and US3 clusters.
  • 03:11 UTC – Engineers identified subnet capacity as a contributing factor; mitigations began.
  • 04:03 UTC – All clusters were scaled out to distribute traffic; latency began to improve in MT1 and US3 clusters.
  • 08:17 UTC – Manual intervention allowed US2 to scale successfully, restoring normal latency.

Resolution

To restore service, the engineering team:

  • Scaled out MT1 and US3 clusters to handle increased traffic loads
  • Monitored all clusters to confirm sustained stability

Once additional capacity was provisioned and high loads normalized, latency levels returned to normal and remained stable.

Preventative Actions

To prevent recurrence, Pusher has initiated the following actions:

  • Rate Limits: We will re-evaluate how rate limits are implemented to better mitigate content from neighboring customers on the shared clusters.
  • Subnet Expansion: Re-evaluating and increasing the size of subnets assigned to shared clusters to ensure sufficient IP availability for future scaling events.
  • Load Balancer Enhancements: We will implement load balance sharding in order to better distribute connections.
  • Capacity Planning Improvements: Enhancing internal monitoring and alerting for subnet and IP utilization thresholds.

Next Steps and Commitment

We recognize that message latency and delivery reliability are critical to our customers’ applications. Our team is continuing a full review of cluster capacity management and provider configuration to improve resilience under high traffic conditions.

We apologize for the disruption this incident caused and appreciate your patience while we worked to resolve it. Ensuring reliability and transparency remains our highest priority.

Posted Oct 23, 2025 - 14:17 UTC

Resolved

Latency has remained stable over the past three hours. Our investigation identified multiple contributing factors to the earlier latency increase, including network saturation and capacity limits at our cloud providers affecting the US2 cluster. The US3 cluster issue was resolved quickly through autoscaling, while the US2 cluster experienced scaling delays that required manual intervention. We are implementing corrective measures to prevent recurrence and improve system resilience.
Posted Oct 19, 2025 - 08:17 UTC

Update

We have scaled all three clusters out to handle the increased traffic. We anticipate the latency to slowly improve.
Posted Oct 19, 2025 - 04:03 UTC

Identified

The issue has been identified and we are applying mitigations. We are seeing improved latency in main and us3 clusters. us2 still has elevated latency.
Posted Oct 19, 2025 - 03:11 UTC

Investigating

Increased latency and degraded performance on clusters main, us2 and us3.
Posted Oct 19, 2025 - 00:44 UTC
This incident affected: Channels WebSocket client API and Channels presence channels.