We experienced an issue whereby our ELB began removing otherwise healthy machines from the load balancing set. Concentrating all traffic on the remaining machines caused them to become significantly overloaded and report as unhealthy to the load balancer.
This caused a cascading failure, more machine removed led to more traffic on those remaining, and the remaining being removed.
We loosened the parameters of our healthchecks to halt the cascade and replaced all existing machines behind the ELB with fresh ones, as newly added machines were not being marked unhealthy by the ELB.
We will be reviewing our healthchecking strategy and investigating the root cause of the initial removals.
We apologise for the interruption to your service.