Spike in Delivery on Line Items

Incident Report for Xandr

Postmortem

Spike in Delivery on Line Items



Incident Summary

From approximately 21:15 UTC on Monday, Sep 11 to 05:00 UTC on Tuesday, Sep 12, 2023, users experienced an unexpected increase in spend, during the impact window



Incident Impact

  • Nature of Impact(s): Increase in delivery across only a few selected line items
  • Timeframe: ~7.75 Hours. 21:15 UTC on Monday, Sep 11 to 05:00 UTC on Tuesday, Sep 12, 2023.
  • Scope: Global
  • Components: Budget Delivery



Timeframe (UTC)

  • 2023-09-11 21:15: Incident Started.
  • 2023-09-12 00:25: Escalated to Engineer.
  • 2023-09-12 00:25: Changes rolled back.
  • 2023-09-12 05:00 Incident Resolved.

 


Cause Analysis

The issue was caused due to a planned release that was implemented to enhance memory resources to optimize the performance of relevant endpoints. However, during the implementation of this improvement, the endpoints failed to handle the concurrent requests resulting in traffic drops and inaccurate daily budget calculations. This caused users to experience an unexpected increase in spend, during the impact window.



Resolution Steps

The issue was resolved by reverting the changes to a stable version and by triggering the relevant jobs that were necessary to reprocess the data to update budgets for all the Line Items.

 


Follow-up Items

  • To mitigate potential incidents and prevent unintended failures, engineers revisited and implemented additional unit and integration test coverages
  • Inspected relevant alerts and metrics to add relevant threshold parameters.
  • Created a staging instance that replicates production traffic, enabling us to minimize potential future disruptions while testing new features. This would ensure to avert occurrences in the future that would potentially result in causing a similar IM


Posted Sep 26, 2023 - 18:24 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Sep 12, 2023 - 21:33 UTC

Monitoring

We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.

Posted Sep 12, 2023 - 16:50 UTC

Investigating

We are currently investigating the following issue::

  1. Component(s): Ad Serving, Bidding
  2. Impact(s):
    • Increase in delivery across all line items
  3. Not Impacted:
    • UI
    • API
  4. Geolocation(s): Global (Global)

Status: We will provide an update as soon as more information is available. Thank you for your patience.

Posted Sep 12, 2023 - 16:27 UTC