During a migration to new infrastructure, we encountered a bug causing high latency and timeout errors when publishing messages to API.
Engineers were notified at 15:31 and began investigating. At 15:45, we decided to roll back the migration, which resulted in reduced latencies and restored normal service by 16:48.
Although this migration was a routine task that had been successfully completed on other clusters without any issues, in the case of mt1 and us2 clusters, it was observed that the sidecar proxy container, which we deploy with our application, failed to report connection errors due to a bug in its readiness probe. This led to unhealthy pods continuing to operate, ultimately resulting in reduced effective cluster capacity. We have since identified and implemented a fix to prevent this issue from occurring in the future.
It is important to note that while we have identified and corrected the issue with the readiness probe in our proxy sidecar container, we are still actively investigating to determine the root cause of the connection issue that occurred in those clusters.
We will not attempt another migration until all the issues have been resolved and confirmed via rigorous testing.