Ad Serving Degraded
Incident Report for Xandr
Postmortem

Incident Summary

From approximately 00:32 UTC on Tuesday, Jan 09 to 03:30 UTC on Tuesday, Jan 09, 2024, a platform-wide outage in ad-serving impacted all regions, services, and subscriptions, during the impact window.



Incident Impact

  • Nature of Impact(s): Changes were not reflected for certain objects, particularly on the Sell-side and creatives.
  • Incident Duration: ~2.96 Hours. 00:32 UTC on Tuesday, Jan 09 to 03:30 UTC on Tuesday, Jan 09, 2024.
  • Scope: Global
  • Components: Ad Serving



Timeframe (UTC)

  • 2024-01-09 00:32: Incident started.
  • 2024-01-09 01:35: Impact started.
  • 2024-01-09 01:54: Escalated to engineer and root cause identified.
  • 2024-01-09 02:24: Impact mitigated.
  • 2024-01-09 03:30: Incident resolved.




Root Cause

A change in the code as a part of a planned release intended to reduce the memory usage and accelerate batch processing inadvertently introduced a bug which failed to accommodate the distinctive characteristics of mediated creatives. The unintended consequence was a global crash of applications when an internal user activated the code path by creating a mediated creative on our test seat which resulted in a decline in ad-serving performance, leading to a widespread outage on the platform during the specified impact window.



Resolution

The issue was resolved by deactivating the creatives with erroneous content to prevent further crashes, and the application configuration was updated to rectify the processing issue related to mediated creatives. The solution was developed and tested. Following successful testing, all instances were upgraded with the new configuration and restarted, effectively resolving the issue.



Immediate measures taken to mitigate the Incident

  • To enhance our system's stability, our team has conducted a thorough analysis to explore the possibility of implementing a more proactive approach by updating relevant codebase to prevent memory errors while processing mediated creatives. 
  • Furthermore, the team conducted a thorough inspection to understand the sequence of events and identify the reasons why the validations impeded start-up in the data center during the rollout. This would ensure to avert occurrences in the future that would potentially result in causing a similar IM.



Posted Feb 14, 2024 - 11:57 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Jan 09, 2024 - 20:16 UTC
Monitoring

We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.

Posted Jan 09, 2024 - 16:07 UTC
Investigating

We are currently investigating the following issue::

  1. Component(s): Ad Serving
  2. Impact(s):
    • Updates not reflected for some objects, especially sell-side and creative
  3. Not Impacted:
    • Ad Serving
    • UI
    • API
  4. Geolocation(s): North America - East Coast (NYM2)

Status: Ad Serving has now been restored.

Posted Jan 09, 2024 - 03:28 UTC