Librato/Datadog integration issue
Incident Report for Pusher
Postmortem

Overview

For approximately 31 hours, stats weren't being exported to our customers making use of our Datadog/Librato integrations. This was due to a bug in the failover logic in our “Exporter” component. We fixed the bug, and added alarms to detect this problem in future. No stats were lost during this outage.

Timeline

Time (UTC) Event
Aug 7, 20:06 Our Exporter component (which sends stats to Datadog and Librato) experiences some connectivity issues. Exporter runs in a master/slave setup. Due to connectivity issues, a failover was triggered. The master ceased its master operations, but didn’t stop being master! Exporter reads stats from a queue in a Kafka cluster. From this point, the stats queue in Kafka is growing.
Aug 8, 17:40-45 We manually force failovers, but the service does not fail over correctly.
Aug 8, 18:15 We identify and patch a bug in our failover logic. From this point, the stats queue in Kafka is being processed.
Aug 9, 01:45 We restart one of the Kafka nodes, which was complaining about Out of Memory.
Aug 9, 03:45 The stats queue is fully processed. Exporter is exporting realtime data again.

What went wrong?

  • We did not have appropriate alarms for our on-call engineer, which meant the problem was not identified promptly.
  • We had a couple of alarms that would trigger if the number of exports was 0. However, the services weren't reporting the metric at all, and that meant the alarm was in a no-data state.
  • Bugs in service becoming prevalent under load.
  • A bug in the automatic failover of the Exporter component.
  • One of the Kafkas was not entirely happy and had to be restarted. However, our Kafka dashboard didn't reveal any issues.

How was it fixed?

  • Fixing the bug in the failover logic: publishing to Librato or Datadog didn't include any timeouts.
  • Adding extra alarming and improving visibility. We’re currently monitoring our new alarms for false positives. When we are confident in these alarms, we will use them for on-call alerts.
Posted Aug 13, 2018 - 09:00 UTC

Resolved
There was an issue with our integration with Librato and Datadog meaning that stats were not exported. This has since been resolved, and extra alerting has been put in place. This did not impact any core functionality. Sorry for any inconvenience!
Posted Aug 08, 2018 - 16:51 UTC
This incident affected: Channels Stats Integrations.