Stats not displayed on dashboard.pusher.com

Incident Report for Pusher

Postmortem

Stats graphs on dashboard.pusher.com are backed by a “logs” table in a MySQL database. Entries in this table are populated from a Kafka instance by a component called "stats-forwarder". At 01:56 UTC, the stats-forwarder stopped forwarding. We are still investigating the root cause; we have improved our logging to aid investigation.

A separate component called "stats-forwarder-production-tests" checks whether stats-forwarder is making progress, and this correctly identified the issue, but we did not get an on-call alert for this, because the alert was misconfigured. As such, we did not find out until 14:45 UTC. This alert is now fixed.

Once we were aware of the issue, we restarted stats-forwarder, which started repopulating the logs table. However, we found that stats-forwarder was dropping 5% of the stats. These stats were being dropped because the connection pool from stats-forwarder to MySQL was unbounded in size, so the instance that stats-forwarder runs on eventually could not open any new connections. This bug was only exhibited under high load, due to the large backlog of stats in Kafka. We have fixed this bug by bounding the connection pool size in stats-forwarder.

Those 5% of stats are now lost, resulting in slightly lower usage reported in the dashboard for 2018-10-09. We apologize for the loss of these stats, and for the delayed display of stats on the dashboard.

Posted Nov 14, 2018 - 17:32 UTC

Resolved

Stats have been re-aggregated and are now fully populated in the dashboards.

Sorry for the inconvenience. We will update this issue with a postmortem once we have more details.

Posted Nov 09, 2018 - 22:37 UTC

Update

Stats are being re-aggregated. This will be visible in historical stats on the dashboard. This should take no more than four hours.

Posted Nov 09, 2018 - 17:13 UTC

Update

We have identified the problem. This was due to a misconfiguration where the number of connections our microservice responsible for writing stats data to Mysql was unbounded. This lead to us hitting the ephemeral port limit for the machine making the writes. We have since bounded this value and are now catching up to realtime.

We expect there to be a small amount of data loss for customer visible stats. We are not expecting to recover this data. Customers may see undercounted messages or connections.

This did not affect message delivery, limits or stats exports to our Application Performance Metrics Providers (Datadog and Librato).

We're sorry for any inconvenience!

Posted Nov 09, 2018 - 16:46 UTC

Identified

Our stats-forwarder component stopped forwarding metrics to our dashboard. We have restarted it. Stats should be realtime again within two hours. Historical logs are missing from the dashboard due to failed aggregations, which we will re-run.

Posted Nov 09, 2018 - 15:09 UTC

Investigating

We are currently investigating this issue.

Posted Nov 09, 2018 - 14:56 UTC

This incident affected: Channels Stats Integrations.