LLD, Reporting imps and clicks partial data

Incident Report for Xandr

Postmortem

LLD, Reporting Impressions and Clicks partial data

Incident Summary

From approximately 10:07 UTC on Monday, Jul 15 to 17:47 UTC on Thursday, Jul 18, 2024, users observed delays while retrieving certain LLD feeds and reporting data, during the impact window.

Incident Impact

Nature of Impact(s):
The billing report for the period from 07/15 hour 09 to 07/18 hour 12 did not reflect the associated buyer charges.
Increased latency, occasional timeouts, and errors were observed in the affected services during the impact window.
Some data may have been incomplete or inaccurate until reprocessing was completed.
Data was temporarily unavailable through the impacted services.
Report retrieval times were slower than usual.
Incident Duration: ~79.66 Hours. 10:07 UTC on Monday, Jul 15 to 17:47 UTC on Thursday, Jul 18, 2024.
Scope: Global
Components:
Log Level Data
Reporting
Billing

Timeframe (UTC)

2024-07-15 10:07: Incident started.
2024-07-18 13:55: Issue detected.
2024-07-18 13:56: Escalated to engineer.
2024-07-18 14:17: Root cause identified.
2024-07-18 17:47: Incident resolved.

Root Cause

The issue stemmed from a code change intended to implement a new logic in the pipeline, specifically applying a discount to a designated term_id. This code alteration inadvertently caused incorrect buyer charge calculations. Consequently, this led to incomplete buyer charge data for auctions and stale reporting data, impacting users' ability to access the Reporting and LLD feeds during the impact window.

Resolution

Once the issue was identified, the application was rolled back to the previous stable version and the reporting behaviour was observed to be functioning as expected following the rollback which retrieved the feeds in the successive hours.
Additionally, manually triggering the necessary jobs ensured the reporting numbers corrected themselves in the following processing cycles. The system resumed normal operations automatically as subsequent runs processed successfully.

Immediate measures taken to mitigate the Incident

Proactively analyzed and enabled automated cross-data centre checks, ensuring comprehensive inspections of dashboards and alerts. Additionally, new alerts with optimized threshold parameters have been integrated into our monitoring tools for improved oversight.
Revisited existing runbooks to remediate the process by focusing on cluster resource management and job configurations specific to clusters. Team proactively worked on optimizing the purge jobs to run for defined block periods rather than successive hours, enhancing efficiency and minimizing the chances of errors during the purge process.
Enhanced the resource pool for critical applications for a possible tuning, including storage expansion to retain 2x data on clusters and optimized data retention policies to reduce reliance on cold storage.

The result of this endeavour will lead to enhanced engineering procedures, and implementation of rigorous release protocols, all aimed at ensuring greater stability and to avert occurrences in the future that would potentially result in causing a similar incident.

Posted Oct 29, 2024 - 12:20 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Jul 23, 2024 - 18:32 UTC

Monitoring

We have patched the issue and are monitoring our systems closely. Clients may continue to experience latencies / inconsistencies. We will provide an update as soon as the issue has been fully resolved.

Posted Jul 19, 2024 - 15:56 UTC

Identified

We have identified the following issue::

Component(s): Log Level Data, Reporting
Impact(s):
- Latency, timeouts and errors for affected service
- Some data incomplete or incorrect until reprocessed (please repull data as necessary)
- Data unavailable through affected service(s)
- Slower report retrieval
Not Impacted:
- Ad Serving
- Bidding
- UI
Geolocation(s): Global (Global)

Status: Our engineers are actively working towards a resolution, and we will provide an update as soon as possible. Thank you for your patience.

Posted Jul 18, 2024 - 18:13 UTC