Some Line Items Using Lifetime Budgets not Pacing from 04:00 UTC
Incident Report for Xandr
Postmortem

Lifetime pacing Line Item(s) and Insertion Order(s) not spending since 04:00 UTC 

 

 

Incident Summary 

From approximately 03:04 UTC on Wednesday, Oct 30 to 18:00 UTC on Wednesday, Oct 30, 2024, users experienced a significant drop in spend for certain buy side objects, during the impact window. 

 


Incident Impact 

  • Nature of Impact(s):  
  • Reduced or no spend across some lifetime-paced Line Items (LIs) and Insertion Orders (IOs) that were nearing their budget limits. 
  • Daily paced objects were unaffected. 
  • Incident Duration: ~14.93 Hours. 03:04 UTC on Wednesday, Oct 30 to 18:00 UTC on Wednesday, Oct 30, 2024. 
  • Scope: Global 
  • Components:  
  • Ad Serving 
  • Buy-side pages 

 

 

Timeframe (UTC) 

  • 2024-10-30 03:04: Incident started. 
  • 2024-10-30 11:16: Issue detected. 
  • 2024-10-30 13:45: Escalated to engineer. 
  • 2024-10-30 15:48: Root cause identified. 
  • 2024-10-30 18:00: Impact mitigated. 

 


Root Cause  

 

The root cause was observed to be an issue with the application responsible for duplicating streaming spend data to downstream tables. This process unexpectedly caused older data to be read and written into the database, leading to inflated spend calculations. Consequently, this inflation caused the budget distribution controller to miscalculate daily budgets. In cases where the inflated spend exceeded the allocated budget, the daily budget was set to zero, halting spend activity for certain lifetime-paced objects, during the impact window. 

 

 

Resolution  

The issue was resolved through a series of targeted measures aimed at restoring system functionality: 

  • The affected pods were manually restarted to address the underlying issue, and the inflated spend data was manually removed from the impacted tables to correct discrepancies.  
  • To mitigate high memory usage and stabilize performance, the relevant controller was also restarted, which allowed the application to catch up by reprocessing data in the budget pipeline. This cleared the inflated spend data and allowed objects to resume spending, with recovery observed in alerts tracking member-level trends. 

These measures restored normal application operations in its subsequent runs, ensuring typical processing rates and a stable state for the impacted controller. 

 

 

Immediate measures taken to mitigate the Incident 

To ensure the stability and reliability of our platform, our team has implemented a series of immediate measures to address the ongoing incident and minimize any further impact: 

  • Team has meticulously analyzed options for adopting a proactive stance in implementing alerts to closely monitor high volumes of messages being read and written to our application’s queue, enabling us to detect potential bottlenecks in real time. 
  • To improve visibility and traceability, we have enhanced our logging and introduced a committed offset metric for data synchronization. This will aid in quicker diagnostics and ensure synchronization is proceeding as expected. 
  • To ensure stability across all deployments, we have re-released all data syncs, mitigating any instability from prior issues and promoting a smoother data flow. Additionally, a hotfix was deployed to properly commit offsets, ensuring accurate data synchronization. 
  • We have also updated the application’s configurations to expedite the topic cleanup process, limiting data retention. This reduces the chances of old data being re-read, thereby preventing unnecessary reprocessing. 
  • We have inspected relevant alerts and ensured to add additional alerts with better threshold parameters to our automated monitoring tool to track the controller queue, allowing for prompt identification and resolution of any processing delays. Furthermore, we have introduced validations to check for data discrepancies and ensure synchronization integrity. In the event of inconsistencies, an alert system has been established to immediately notify the team for further investigation. Furthermore, we have updated our runbooks with detailed instructions on how to reprocess discrepant data through remedial actions, ensuring a swift resolution to any future data synchronization issues. 

 

The result of this endeavour will lead to enhanced engineering procedures, and implementation of rigorous release protocols, all aimed at ensuring greater stability and to avert occurrences in the future that would potentially result in causing a similar incident. 

 

 

Posted Nov 19, 2024 - 16:07 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Oct 30, 2024 - 23:53 UTC
Monitoring

We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.

Posted Oct 30, 2024 - 18:32 UTC
Update
We are continuing to work on a fix for this issue.
Posted Oct 30, 2024 - 18:06 UTC
Identified

We have identified the cause of the issue, and our engineers are actively working towards a resolution. We will provide an update as soon as possible. Thank you for your patience.

Posted Oct 30, 2024 - 17:47 UTC
Investigating

We are currently investigating the following issue::

  1. Component(s): Ad Serving, Buy-side pages
  2. Impact(s):
    • Drop in Line Item Pacing for LIs Using Lifetime Budgets
  3. Not Impacted:
    • API
  4. Geolocation(s): Global (Global)

Status: We will provide an update as soon as more information is available. Thank you for your patience.

Posted Oct 30, 2024 - 12:21 UTC