Elevated API Errors in AP4 Cluster

Incident Report for Pusher

Postmortem

Root Cause Analysis: Elevated API Errors and Outage in AP4 Cluster

Incident Date: October 20, 2025

Status: Resolved

Summary

Between October 20 and October 21, 2025, customers using the AP4 cluster experienced elevated API errors, latency, and message publishing failures. The issue primarily affected the Channels API, preventing customers from publishing new messages and leading to degraded real-time functionality for end-users.

During system recovery and while implementing mitigations from a previous incident on Oct 18th, a misconfigured Redis container in the AP4 cluster failed to start correctly, preventing caching operations needed for API requests. This misconfiguration went undetected by proactive monitoring, delaying full recovery until October 21 at 17:43 UTC.

Impact

Throughout the incident period, customers in the AP4 cluster experienced:

  • High API error rates when attempting to publish messages through the Channels API
  • Failed or delayed message delivery for connected clients
  • Temporary downtime for end-customer applications relying on real-time messages

Other clusters remained operational, though some minor latency was observed in isolated regions due to dependencies on shared services.

Root Cause

This incident resulted from a chain of events involving both external and internal factors:

  1. **Major AWS Outage (October 20)**A large-scale AWS outage in the US-East region disrupted multiple dependent systems, impacting several Pusher clusters.
  2. Misconfigured Redis Container (October 21) As systems in the AP4 cluster attempted to scale during recovery, one of the backend Redis cache containers failed to start due to a misconfigured environment variable. This prevented Redis from initializing properly, resulting in API operations failing or timing out.
  3. Monitoring Gap Existing monitoring did not capture the Redis startup failure because the specific failure mode occurred after initialization checks had passed. This delayed internal detection until API error rates increased and customer impact was observed.
  4. Delayed Customer Communication Initial updates to customers were delayed while the team triaged the issue and verified the failure pattern, prolonging the time before external notification.

Detection and Response

The issue was detected through a combination of monitoring alerts showing elevated error rates and customer reports of publishing failures.

Timeline of Events:

  • October 20 – AWS outage began, affecting multiple Pusher clusters leading to increased delays and errors. October 20, evening UTC – Pusher clusters began recovery as AWS services were restored.
  • October 21, 15:06 UTC – Internal monitoring detected elevated API errors in AP4; engineers began investigation.  Incident was unrelated to prior AWS outage.
  • October 21, 15:09 UTC – Root cause identified as a failed Redis caching container.
  • October 21, 15:27 UTC – Restoration of Redis connections underway.
  • October 21, 15:29 UTC – Fix implemented; cluster began gradual recovery.
  • October 21, 17:39 UTC – Full stabilization confirmed across AP4 nodes.
  • October 21, 17:43 UTC – Incident marked resolved after sustained recovery.

Resolution

To restore full functionality, the engineering team:

  • Corrected the Redis container configuration preventing startup
  • Restarted and validated cache services across all AP4 nodes
  • Confirmed API endpoints were fully operational and message publishing resumed
  • Monitored latency and error metrics to confirm sustained stability

Preventative Actions

To reduce recurrence risk and improve detection and response, Pusher is implementing the following:

  • Enhanced Redis Monitoring: Extending monitoring coverage to detect Redis startup and post-init failures.
  • Customer Communication Enhancements: Improving internal escalation and communication processes to ensure faster external updates.

Next Steps and Commitment

We recognize the importance of reliable API performance for our customers. Our teams are conducting a full review of caching dependencies and configuration management across all clusters to prevent similar incidents.

We sincerely apologize for the disruption caused by this event and appreciate your patience as we worked through a complex multi-day recovery scenario. Pusher remains committed to transparency, reliability, and continuous improvement in service resilience.

Posted Oct 23, 2025 - 14:14 UTC

Resolved

This incident has been resolved.
Posted Oct 21, 2025 - 17:43 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Oct 21, 2025 - 17:39 UTC

Identified

We are currently investigating an issue affecting the Channels API. A large number of customers may be unable to publish new messages through the API in the AP4 cluster.
Posted Oct 21, 2025 - 17:32 UTC

Monitoring

A fix has been implemented and we are monitoring. The cluster will take a few more minutes to fully stabilize across all nodes.
Posted Oct 21, 2025 - 15:29 UTC

Update

The team continues to make progress restoring the cache services. We anticipate full resolution in the next 10-15 minutes.
Posted Oct 21, 2025 - 15:27 UTC

Identified

The team has identified the issue with one of our backend caching servers. The team is working to restore the connections to the server.
Posted Oct 21, 2025 - 15:09 UTC

Investigating

We're experiencing an elevated level of API errors and latency in AP4 and are currently looking into the issue.
Posted Oct 21, 2025 - 15:06 UTC
This incident affected: Channels REST API, Channels WebSocket client API, and Channels Stats Integrations.