INCIDENT SUMMARY
From approximately 15:00 UTC to ~23:00 UTC on Monday, May 22, 2023, a planned release introduced a bug that resulted in a data stall and a subsequent bidder outage in datacenters NYM2, AMS3, and FRA1.
INCIDENT IMPACT
• Nature of Impact: Lack of updates for the majority of console bidder objects from 15:00 UTC to ~23:00 UTC (22:30 UTC for NYM2/AMS3 and 23:15 UTC for FRA1/LAX1/SIN3). The subsequent bidder outage significantly reduced console bidding in the datacenters during the following timelines:
• FRA1:
19:52-20:35 UTC - 95% outage
• AMS3:
19:52-20:40 UTC - 80% outage
21:37-22:32 UTC - 80% outage
• NYM2:
19:52-20:50 UTC - 80% outage
20:50-21:05 UTC - 40% outage
21:37-22:40 UTC - 80% outage
• Timeframe: ~8.00 Hours. 15:00 UTC to ~23:00 UTC on Monday, May 22, 2023.
• Scope: Global
• Components: Console Bidder
TIMELINE (UTC)
• 2023-05-22 15:00: The updates to most console bidder objects stopped.
• 2023-05-22 17:47: Escalated to Engineer through our automated monitoring tool.
• 2023-05-22 18:51: IM Ticket Created.
• 2023-05-22 19:00: Data SLA of 4hrs was breached.
• 2023-05-22 19:54: First bidder outage occurred in datacenters : NYM2, AMS3, and FRA1.
• 2023-05-22 20:30: The Engineers began the process of reverting the related commit.
• 2023-05-22 21:37: Second bidder outage occurred in datacenters : NYM2 and AMS3.
• 2023-05-22 22:30: Console bidding fully recovered and updates to bidder objects started to propagate gradually.
• 2023-05-23 02:30: All bidder object updates were completely up to date. The engineers continued to monitor the stability.
• 2023-05-23 03:03: Incident Resolved. Clients were notified accordingly.
CAUSE ANALYSIS
The issue was caused due to a planned release which introduced a bug that resulted in a data stall and a subsequent bidder outage in datacenters NYM2, AMS3, and FRA1. As a consequence, the breach of the 4-hour data SLA resulted in a lack of updates for the majority of console bidder objects from 15:00 UTC to ~23:00 UTC (22:30 UTC for NYM2/AMS3 and 23:15 UTC for FRA1/LAX1/SIN3).
RESOLUTION STEPS
The resolution involved reverting the modifications to a stable version, and additionally deploying a hotfix which effectively addressed the issue.
FOLLOW-UP ITEMS
Revised and enhanced the process to test new features on smaller set of instances prior to deploying a full release as a way to evaluate their performance and functionality. This would ensure to avert similar occurrences in the future that would potentially result in causing a similar IM.
Update - We have identified the following issue:
Ad Serving and Bidding has been restored. Our engineers are actively working towards a resolution to both issues and we will provide an update as soon as possible. Thank you for your patience.
We have identified the following issue:
Our engineers are actively working towards a resolution to both issues and we will provide an update as soon as possible. Thank you for your patience.
We are monitoring the following issue:
Ad Serving and Bidding has been restored. Our engineers are actively working towards a resolution to object update issues, and we will provide an update as soon as possible. Thank you for your patience.
We are investigating the issue and will provide an update as soon as more information is available. Thank you for your patience.
We have identified the following issue::
Our engineers are actively working towards a resolution, and we will provide an update as soon as possible. Thank you for your patience.
We have identified the following issue::
Our engineers are actively working towards a resolution, and we will provide an update as soon as possible. Thank you for your patience.