On March 17th, at 05:48 UTC, on-call engineers were paged by our alerting system that there is a problem with the webhooks and increased api errors on MT1. API requests and webhooks related to presence channels members, and channel existence for 50% of apps using these features were affected until the incident was resolved at 06:37 UTC.
The affected Redis shard cluster is only responsible for specific features: API requests and webhooks related to presence channels members, and channel existence. There are two redis shard clusters in MT1. One of these shard clusters went down due to problems in the underlying infrastructure. Bringing this shard cluster back up resolved the incident.
A single Redis shard cluster had repeated failures in doing full synchronisation from the primary instance to the replicas due to insufficient disk space on the primary instance. At some point, the request for full synchronisation from two replicas was aligned in a way that resulted in the memory of the primary instance constantly increasing until it went out-of-memory. At this point, the primary instance went down.