Increased latency on MT1 and US2 Clusters - Starting at 15:15 UTC

Incident Report for Pusher

Postmortem

During a migration to new infrastructure, we encountered a bug causing high latency and timeout errors when publishing messages to API.

Engineers were notified at 15:31 and began investigating. At 15:45, we decided to roll back the migration, which resulted in reduced latencies and restored normal service by 16:48.

Although this migration was a routine task that had been successfully completed on other clusters without any issues, in the case of mt1 and us2 clusters, it was observed that the sidecar proxy container, which we deploy with our application, failed to report connection errors due to a bug in its readiness probe. This led to unhealthy pods continuing to operate, ultimately resulting in reduced effective cluster capacity. We have since identified and implemented a fix to prevent this issue from occurring in the future.

It is important to note that while we have identified and corrected the issue with the readiness probe in our proxy sidecar container, we are still actively investigating to determine the root cause of the connection issue that occurred in those clusters.

We will not attempt another migration until all the issues have been resolved and confirmed via rigorous testing.

Posted Mar 02, 2023 - 17:57 UTC

Resolved

This incident has been resolved.

Posted Feb 28, 2023 - 17:06 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 28, 2023 - 16:53 UTC

Identified

A latency increase was observed on the MT1 and US2 clusters after a deployment. Engineers are activity working on a fix / roll back.

Posted Feb 28, 2023 - 15:50 UTC

This incident affected: Channels REST API.