From approximately 09:14 UTC to 10:06 UTC on January 30, 2020, we experienced connection issues with some of our servers for approximately 2 mins, causing the data in our quick stats dashboards to be temporarily unavailable.
Scope of Impact
During the incident window, customers could not see all data load within our quick stats dashboards (Console Grids, Buyer Monitoring Workflow, Monetization Dashboard), with the exception of a relatively small number of cached requests which were still going through. This impacted all data centers.
2020-01-30 09:14:08: A vendor bug and an incorrect rack switch configuration causes all interfaces within the rack switch to turn on and off
2020-01-30 09:14:25: Backend databases register network errors
2020-01-30 09:14:26: Incident Started: Quick stats requests start to fail
2020-01-30 09:15:47: Rack switch resumes normal operation
2020-01-30 09:23:27: First alert for the backend databases is received
2020-01-30 09:30:00: Databases restarted
2020-01-30 09:32:31: Quick stats requests begin to succeed again
2020-01-30 09:59:03: Backend jobs for the Buyer Monitoring Workflow restarted
2020-01-30 10:06:00: Incident Resolved: All services confirmed back online and running
The incident was caused by a vendor bug which caused all interfaces within the affected rack switch to restart. Following that, due to an incorrect rack switch uplink, the bounce in the switch interfaces caused further connection issues for the servers in the affected rack. Although the network connection recovered in approximately 2 mins, it caused various components to go down.
Once the network connection issue recovered, our engineers brought back up the components affected by this by restarting the backend databases and jobs, resulting in all services being back online and running.
- Fix the affected rack switch uplink configuration
- Fix interface restart bug
- Set shorter monitoring intervals for prompter escalation
The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.
We have identified and patched the following issue:
We are monitoring our systems closely, and will provide an update as soon as the issue has been fully resolved.