For approximately 31 hours, stats weren't being exported to our customers making use of our Datadog/Librato integrations. This was due to a bug in the failover logic in our “Exporter” component. We fixed the bug, and added alarms to detect this problem in future. No stats were lost during this outage.
|Aug 7, 20:06||Our Exporter component (which sends stats to Datadog and Librato) experiences some connectivity issues. Exporter runs in a master/slave setup. Due to connectivity issues, a failover was triggered. The master ceased its master operations, but didn’t stop being master! Exporter reads stats from a queue in a Kafka cluster. From this point, the stats queue in Kafka is growing.|
|Aug 8, 17:40-45||We manually force failovers, but the service does not fail over correctly.|
|Aug 8, 18:15||We identify and patch a bug in our failover logic. From this point, the stats queue in Kafka is being processed.|
|Aug 9, 01:45||We restart one of the Kafka nodes, which was complaining about Out of Memory.|
|Aug 9, 03:45||The stats queue is fully processed. Exporter is exporting realtime data again.|