Problems With REST API and Websocket Latencies

Incident Report for Pusher

Postmortem

We experienced an issue whereby our ELB began removing otherwise healthy machines from the load balancing set. Concentrating all traffic on the remaining machines caused them to become significantly overloaded and report as unhealthy to the load balancer.

This caused a cascading failure, more machine removed led to more traffic on those remaining, and the remaining being removed.

We loosened the parameters of our healthchecks to halt the cascade and replaced all existing machines behind the ELB with fresh ones, as newly added machines were not being marked unhealthy by the ELB.

We will be reviewing our healthchecking strategy and investigating the root cause of the initial removals.

We apologise for the interruption to your service.

Posted Mar 03, 2015 - 18:50 UTC

Resolved

The issues with the REST API and socket latencies have been resolved
Posted Mar 03, 2015 - 18:08 UTC

Monitoring

Service is now much improved, but we are monitoring the situation.
Posted Mar 03, 2015 - 17:02 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Mar 03, 2015 - 16:56 UTC

Update

Service has greatly recovered, but a small percentage of users may still be experiencing intermittent failures.
Posted Mar 03, 2015 - 16:56 UTC

Investigating

Currently we are investigating issues with Websocket latencies and the REST API, since 16:05
Posted Mar 03, 2015 - 16:19 UTC
This incident affected: Channels REST API and Channels WebSocket client API.