From 2019-12-12 to 2019-01-14, clients using Keep-Alive when making HTTP requests to the Channels REST API on our mt1
cluster could receive incorrect HTTP responses.
The main feature impacted by this issue was the REST query API: requests for channel metadata may have had incorrect responses. Core pub/sub message delivery was not affected by this issue, though requests may have had incorrect responses. Client-side features, including presence and client events, were not affected. Webhooks were not affected.
The underlying issue was that, under some conditions, our application server processes would send a single 4XX response multiple times, resulting in all subsequent responses on the connection being out of alignment with requests. This bug existed before 2019-12-12, but was masked because a proxy in front of those processes did not use Keep-Alive. This proxy was removed on mt1
on 2019-12-12, exposing this potential issue to end-users. The underlying issue was diagnosed on 2020-01-13 and fixed on 2020-01-14.
(All times in UTC.)
mt1
cluster towards servers on new infrastructure. This cluster has a different set of proxies, all of which respect Keep-Alive.mt1
cluster.The primary issue in this incident was slow response. Once we identified the issue, it took too long to roll back. We have identified two major reasons, each with its own planned mitigation:
mt1
REST traffic. We had to first increase capacity, and this slowed the rollback. To mitigate this future risk, we will keep the old infrastructure at full capacity until it is destroyed.