Issues in us2 cluster

Incident Report for Pusher

Postmortem

Our us2 cluster had issues on 2018-05-31 between around 07:14 and 09:03. The root case was an outage in AWS: the us-east-2 region lost all connection to the Internet for approximately 30 minutes. Internal networking in the region was unaffected, so, with one exception, our us2 cluster became available again when the connection between us-east-2 and the Internet was restored.

The one exception to automatic recovery was our webhook feature. Webhook events are placed on a queue and consumed by a webhook sender. However, this network outage triggered a bug in our webhook sender, which meant it stopped sending webhooks. We manually restarted the webhook sender. Webhook events were not lost, but were delayed by up to two hours.

Posted Jun 22, 2018 - 18:29 UTC

Resolved

This incident has been resolved.

Posted May 31, 2018 - 10:07 UTC

Monitoring

The "us2" Channels cluster is fully operational again. We're monitoring it.

Posted May 31, 2018 - 08:07 UTC

Update

Messages are now being delivered. Other features may still be affected. We're investigating.

Posted May 31, 2018 - 07:56 UTC

Update

We have traced this to underlying problems in AWS EC2 region us-east-2. Many instances are gone or unavailable in this region. See http://status.aws.amazon.com/ - "12:36 AM PDT We are currently investigating connectivity issues in the US-EAST-2 Region."

Posted May 31, 2018 - 07:40 UTC

Investigating

Messages may not be being sent on our us2 cluster. Features relying on messsage delivery are unavailable. We're investigating.

Posted May 31, 2018 - 07:30 UTC

This incident affected: Channels REST API, Channels WebSocket client API, Channels Stats Integrations, Channels presence channels, and Channels Webhooks.