Inconsistencies in presence state/connection counts

Incident Report for Pusher

Postmortem

Overview

Between 11:34 and 19:30 on 18/03/2020 users requesting presence data or connection counts from the Pusher Channels HTTP API on cluster mt1 may have been served inaccurate or stale data. This was due to corrupted state caused by a misconfigured Redis Sentinel instance.

Background

Pusher Channels stores presence channel state and channel subscriber state in a Redis instance spread across multiple shards. To ensure this state is available reliably, each of the shards is run with multiple replicas in different availability zones. Redis Sentinel is used to manage failovers between masters and replicas, and for configuration discovery.

Timeline

(All time are in UTC)

At 11:34 a Channels engineer identified a Redis Sentinel instance in a broken state. The affected Sentinel was directing a small subset of Channels edge processes to a replica instance instead of the master, causing writes to fail. While attempting to perform a repair, the engineer misconfigured the Redis Sentinel process. This caused writes intended for one of the Redis shards to reach a different shard instead.

At 11:52 the Redis shards were correctly configured and all writes were reaching the correct shard. However the period of misconfiguration led to data inconsistencies in both shards and a significant increase in the memory usage of all Redis processes. The Redis memory allowance was increased until it approached the maximum acceptable on the machines.

At 14:25 all Redis processes were migrated to machines with more memory available in order to stabilise the system. Some users were still receiving incorrect presence data and subscriber counts.

Between 15:26 and 17:40 a solution to the data discrepancies was prepared and tested in a test environment

Between 18:25 and 19:29 the solution was applied to the mt1 cluster and the incident resolved.

How will we ensure this doesn’t happen again?

We know operators makes mistakes and, with the scale of the Pusher systems, these mistakes have severe consequences for many people.

It took us too long to fix this problem and customers depending on presence suffered greatly. We are sorry for this and want to make sure we do better next time. Firstly, we need to make this less likely to happen again:

We have used Redis Sentinel in anger for some time now. We know that Sentinel can end up in a “split-brain” state and (surprisingly) this is happening with some frequency. We are tweaking some parameters to make this less likely in the future.
Make mis-configuration harder by either improving our internal procedures or applying more automation

Longer term we need to ensure that when this happens again, we can react faster. We already know how we solve this specific issue, but we know there are problems overall with our presence system. We will look into making this system more resilient and self-healing.

Posted Mar 23, 2020 - 14:34 UTC

Resolved

This incident has been resolved.

Posted Mar 19, 2020 - 09:23 UTC

Monitoring

All known stale data has been removed. All presence data, channel existence data, and connection/subscription counting data should now be accurate and realtime.

Posted Mar 18, 2020 - 19:34 UTC

Update

Our new approach to delete the stale data is succeeding. We have a new clean-up job running. Estimated time to completion is ~60 minutes.

Posted Mar 18, 2020 - 18:30 UTC

Update

Some stale data still remains which was not removed by this job. We have identified the cause of this stale data. We are trying a new approach to remove this stale data.

Posted Mar 18, 2020 - 17:43 UTC

Update

We have started running a job to remove any remaining stale data.

Posted Mar 18, 2020 - 15:51 UTC

Update

You may still be experiencing the following issues on our mt1 cluster:

- presence channels may report users that are no longer members
- channel existence queries/webhooks may report channels has having subscribers when they do not

This is because we still have stale data related to presence and channel existence. This data was not fully cleaned up by the previous job.

We have manually identified the remaining stale data, and we are preparing a new job to clean up this remaining stale data.

Posted Mar 18, 2020 - 15:27 UTC

Update

The job has completed, but we believe we may still have inconsistencies in data related to the state of end user connections and subscriptions, including presence data. We are currently identifying any remaining further inconsistencies.

Posted Mar 18, 2020 - 14:58 UTC

Update

We still have inconsistencies in presence membership and connection counts.

A job is running to eliminate these.

Metrics project that this job will be complete within 60 minutes.

Posted Mar 18, 2020 - 14:12 UTC

Identified

We're currently experiencing issues with a database which stores presence membership information and connection counts.

A solution has been identified and is being deployed now.

Posted Mar 18, 2020 - 12:19 UTC

This incident affected: Channels REST API, Channels WebSocket client API, and Channels presence channels.