Ad Serving Down
Incident Report for Xandr
Postmortem

Incident Summary

From approximately 18:20 UTC on Tuesday, Apr 23 to 22:51 UTC on Tuesday, Apr 23, 2024, a platform-wide outage in ad-serving impacted all regions, services, and subscriptions, during the impact window.



Incident Impact

  • Nature of Impact(s):
  • All end users globally experienced degraded user experience with increased latency during page loads.
  • Ad serving was unavailable globally for ~04 hours.
  • Decrease in requests sent to bidder customers.
  • Drop in delivery on external supply.
  • Users were unable to preview and audit creatives.
  • Incident Duration: ~4.51 Hours. 18:20 UTC on Tuesday, Apr 23 to 22:51 UTC on Tuesday, Apr 23, 2024.
  • Scope: Global
  • Components: Ad Serving




Timeframe (UTC)

  • 2024-04-23 18:20: Incident started.
  • 2024-04-23 18:26: Escalated to engineer.
  • 2024-04-23 20:41: Root cause identified.
  • 2024-04-23 22:51: Incident resolved.



Root Cause

The issue stemmed from a code change implemented as part of a scheduled release in our client-side SDK. This release aimed to address an issue related to HTTP endpoints that were previously not being utilized due to an undetected error in the associated SDK. At 18:20 UTC, a new SDK version was deployed to resolve this error, thereby activating the code on the two endpoints. However, this code alteration activated a code path in our ad-serving application that contained a latent bug, resulting in a platform-wide outage affecting ad-serving and causing unrecoverable memory corruption which lasted for ~04 hours. Consequently, users faced difficulties in previewing and auditing creatives, reduction in overall request volume directed to bidder customers, and an increase in latency during page loads, during the specified impact window.



Resolution

The issue was successfully resolved through a series of proactive measures:

  • Once the issue was identified, all traffic to the two endpoints associated with the bug was blocklisted at the load balancing layer. This step prevented further requests from reaching the endpoint, mitigating the impact of the error.
  • Furthermore, a configuration adjustment and a hotfix for the ad-serving application was deployed to deactivate the problematic endpoint. This facilitated the application to resume its normal operation in subsequent runs and ensured that all affected services were fully restored.



Immediate measures taken to mitigate the Incident

  • To enhance our system's stability, our team has meticulously analyzed options for adopting a proactive stance in swiftly addressing overloaded load balancers to evenly redistribute traffic across the remaining load balancers.
  • Furthermore, we have delved deep into investigating potential network issues that may exacerbate latency problems. We remain vigilant, closely monitoring the situation to gauge any sustained decrease in latency over time.
  • Additionally, our team has conducted thorough investigations to uncover the root cause of high latency, exploring potential network issues and examining client-side connection management. We have implemented a robust process to objectively evaluate our platform for stability and scalability concerns. The result of this endeavour will lead to enhanced engineering procedures, and implementation of rigorous release protocols, all aimed at ensuring greater stability and to avert occurrences in the future that would potentially result in causing a similar incident.

Posted May 02, 2024 - 14:27 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Apr 24, 2024 - 17:15 UTC
Monitoring

We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.

Posted Apr 23, 2024 - 22:49 UTC
Identified

We have identified the cause of the issue, and our engineers are actively working towards a resolution. We will provide an update as soon as possible. Thank you for your patience.

Posted Apr 23, 2024 - 21:58 UTC
Investigating

We are currently investigating the following issue::

  1. Component(s): Ad Serving
  2. Impact(s):
    • Decrease in requests sent to Bidder customers
    • Unable to preview creatives
    • Drop in delivery on external supply
    • Creative audit is down
  3. Not Impacted:
    • UI
    • API
  4. Geolocation(s): Global (Global)

Status: We will provide an update as soon as more information is available. Thank you for your patience.

Posted Apr 23, 2024 - 19:32 UTC