Incident Summary
From approximately 20:40 UTC to 22:05 on Monday, November 23rd the impression bus went partially down in most datacenters resulting in a high percentage of blanks and timeouts.
Scope of Impact
During the incident window customers experienced a high percentage of blanks and timeouts in all datacenters except for the NYM datacenter.
Timeline (UTC)
2020-11-23 19:34:00: Incident started, first crashes occur in SIN datacenter
2020-11-23 19:37:00: Alert received for SIN datacenter crashes
2020-11-23 19:42:00: Incident escalated
2020-11-23 20:06:00: Underlying problem identified
2020-11-23 20:21:00: Initial fix implemented
2020-11-23 20:39:00: Additional alert received indicating that other datacenters were experiencing issues
2020-11-23 21:02:00: Engineering releases a hotfix
2020-11-23 22:21:00: Impression bus starts to recover
2020-11-24 02:14:00: Incident resolved
Cause Analysis
The incident was caused by the removal of a Prebid Server object, which exposed an old and erroneous Impression bus assumption and resulted in the observed datacenter crashes.
Resolution Steps
Our engineers resolved the issue by stopping the source of data causing the issue while simultaneously correcting the underlying error in a new release.
Next Steps
- Schedule additional training to help identify and prevent this issue type in the future.
The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.
We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.
We have identified the following issue:
Our engineers are actively working towards a resolution, and we will provide an update as soon as possible. Thank you for your patience.