On Wednesday, July 7th at 16:17 there was a network issue at Heroku, which caused the Beams API to become unavailable. The incident lasted until 16:46, when the network issue was resolved.
Beams users would not have been able to publish notifications in this time and any requests to our API would have timed out.
Note that the incident start time on this page was modified to reflect the start time of the Heroku incident.
All times are in UTC on the day of July 7th.
At 16:17, a problem at Heroku caused all routing of requests to the Beams API to stop: https://status.heroku.com/incidents/2300. Our engineers were not aware of this at the time.
At 16:19 synthetic tests start failing and alert engineers through slack.
At 16:24 2 engineers are paged and 3 engineers immediately start investigating. They observe a complete drop in incoming traffic.
At 16:30 without any indication of problems, an engineer performs a roll-back of the latest deployment, with no effect.
At 16:46 an engineer opens an incident on the public status page.
At 16:46 incoming traffic starts to come back to normal levels and synthetic tests start passing.
At 17:02 Heroku creates an incident on their status page.
After the incident we have identified 2 main issues:
It took us 27 minutes to update the status page after the first synthetic tests started failing. The synthetic tests emulate real user traffic and is a clear indication that the API is unavailable. We are changing our internal procedures to make sure the status page is updated immediately.
The root cause of the incident was a network problem in a Heroku deployment, in a single region. We know our customers are dependent on our services to be available, even through incidents like these. To mitigate this issue we will start investigating multi-region deployments.