Delay in updates to some objects. Bidding and Adserving Degraded
Incident Report for Xandr
Postmortem

INCIDENT SUMMARY
From approximately 15:00 UTC to ~23:00 UTC on Monday, May 22, 2023, a planned release introduced a bug that resulted in a data stall and a subsequent bidder outage in datacenters NYM2, AMS3, and FRA1.

INCIDENT IMPACT
• Nature of Impact: Lack of updates for the majority of console bidder objects from 15:00 UTC to ~23:00 UTC (22:30 UTC for NYM2/AMS3 and 23:15 UTC for FRA1/LAX1/SIN3). The subsequent bidder outage significantly reduced console bidding in the datacenters during the following timelines:
• FRA1:

19:52-20:35 UTC - 95% outage
• AMS3:
19:52-20:40 UTC - 80% outage
21:37-22:32 UTC - 80% outage
• NYM2:
19:52-20:50 UTC - 80% outage
20:50-21:05 UTC - 40% outage
21:37-22:40 UTC - 80% outage
• Timeframe: ~8.00 Hours. 15:00 UTC to ~23:00 UTC on Monday, May 22, 2023.
• Scope: Global
• Components: Console Bidder

TIMELINE (UTC)
• 2023-05-22 15:00: The updates to most console bidder objects stopped.
• 2023-05-22 17:47: Escalated to Engineer through our automated monitoring tool.
• 2023-05-22 18:51: IM Ticket Created.
• 2023-05-22 19:00: Data SLA of 4hrs was breached.
• 2023-05-22 19:54: First bidder outage occurred in datacenters : NYM2, AMS3, and FRA1.
• 2023-05-22 20:30: The Engineers began the process of reverting the related commit.
• 2023-05-22 21:37: Second bidder outage occurred in datacenters : NYM2 and AMS3.
• 2023-05-22 22:30: Console bidding fully recovered and updates to bidder objects started to propagate gradually.
• 2023-05-23 02:30: All bidder object updates were completely up to date. The engineers continued to monitor the stability.
• 2023-05-23 03:03: Incident Resolved. Clients were notified accordingly.

CAUSE ANALYSIS
The issue was caused due to a planned release which introduced a bug that resulted in a data stall and a subsequent bidder outage in datacenters NYM2, AMS3, and FRA1. As a consequence, the breach of the 4-hour data SLA resulted in a lack of updates for the majority of console bidder objects from 15:00 UTC to ~23:00 UTC (22:30 UTC for NYM2/AMS3 and 23:15 UTC for FRA1/LAX1/SIN3).

RESOLUTION STEPS
The resolution involved reverting the modifications to a stable version, and additionally deploying a hotfix which effectively addressed the issue.

FOLLOW-UP ITEMS
Revised and enhanced the process to test new features on smaller set of instances prior to deploying a full release as a way to evaluate their performance and functionality. This would ensure to avert similar occurrences in the future that would potentially result in causing a similar IM.

Posted May 25, 2023 - 20:09 UTC

Resolved
This incident has been resolved.
Posted May 23, 2023 - 03:01 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 23, 2023 - 02:00 UTC
Update

Update - We have identified the following issue:

  • Component(s): API, UI
  • Impact(s):
    • Updates/changes to some objects may take longer to take effect globally
  • Not Impacted:
    • Reporting
  • Geolocation(s): Global (Global)

Ad Serving and Bidding has been restored. Our engineers are actively working towards a resolution to both issues and we will provide an update as soon as possible. Thank you for your patience.

Posted May 23, 2023 - 00:28 UTC
Identified

We have identified the following issue:

  • Component(s): API, UI, Ad Serving, Bidding
  • Impact(s):
    • Updates/changes to some objects may take longer to take effect globally
    • Ad Serving and Bidding is degraded in AMS and NYM datacenters
  • Not Impacted:
    • Reporting
  • Geolocation(s): Global (Global)

Our engineers are actively working towards a resolution to both issues and we will provide an update as soon as possible. Thank you for your patience.

Posted May 22, 2023 - 22:13 UTC
Monitoring

We are monitoring the following issue:

  • Component(s): API, UI
  • Impact(s):
    • Updates/changes to some objects may take longer to take effect
  • Not Impacted:
    • Reporting
  • Geolocation(s): Global (Global)

Ad Serving and Bidding has been restored. Our engineers are actively working towards a resolution to object update issues, and we will provide an update as soon as possible. Thank you for your patience.

Posted May 22, 2023 - 21:36 UTC
Investigating

We are investigating the issue and will provide an update as soon as more information is available. Thank you for your patience.

Posted May 22, 2023 - 20:45 UTC
Update

We have identified the following issue::

  • Component(s): API, UI, Bidding, Ad Serving
  • Impact(s):
    • Updates/changes to some objects may take longer to take effect
    • Bidding and Ad Serving may be degraded
  • Not Impacted:
    • Reporting
  • Geolocation(s): Global (Global)

Our engineers are actively working towards a resolution, and we will provide an update as soon as possible. Thank you for your patience.

Posted May 22, 2023 - 20:24 UTC
Identified

We have identified the following issue::

  • Component(s): API, UI
  • Impact(s):
    • Updates not immediately reflected for some objects
  • Not Impacted:
    • Ad Serving
    • Bidding
  • Geolocation(s): Global (Global)

Our engineers are actively working towards a resolution, and we will provide an update as soon as possible. Thank you for your patience.

Posted May 22, 2023 - 19:18 UTC