LLD, Reporting Impressions and Clicks partial data
Incident Summary
From approximately 10:07 UTC on Monday, Jul 15 to 17:47 UTC on Thursday, Jul 18, 2024, users observed delays while retrieving certain LLD feeds and reporting data, during the impact window.
Incident Impact
- Nature of Impact(s):
- The billing report for the period from 07/15 hour 09 to 07/18 hour 12 did not reflect the associated buyer charges.
- Increased latency, occasional timeouts, and errors were observed in the affected services during the impact window.
- Some data may have been incomplete or inaccurate until reprocessing was completed.
- Data was temporarily unavailable through the impacted services.
- Report retrieval times were slower than usual.
- Incident Duration: ~79.66 Hours. 10:07 UTC on Monday, Jul 15 to 17:47 UTC on Thursday, Jul 18, 2024.
- Scope: Global
- Components:
- Log Level Data
- Reporting
- Billing
Timeframe (UTC)
- 2024-07-15 10:07: Incident started.
- 2024-07-18 13:55: Issue detected.
- 2024-07-18 13:56: Escalated to engineer.
- 2024-07-18 14:17: Root cause identified.
- 2024-07-18 17:47: Incident resolved.
Root Cause
The issue stemmed from a code change intended to implement a new logic in the pipeline, specifically applying a discount to a designated term_id. This code alteration inadvertently caused incorrect buyer charge calculations. Consequently, this led to incomplete buyer charge data for auctions and stale reporting data, impacting users' ability to access the Reporting and LLD feeds during the impact window.
Resolution
- Once the issue was identified, the application was rolled back to the previous stable version and the reporting behaviour was observed to be functioning as expected following the rollback which retrieved the feeds in the successive hours.
- Additionally, manually triggering the necessary jobs ensured the reporting numbers corrected themselves in the following processing cycles. The system resumed normal operations automatically as subsequent runs processed successfully.
Immediate measures taken to mitigate the Incident
- Proactively analyzed and enabled automated cross-data centre checks, ensuring comprehensive inspections of dashboards and alerts. Additionally, new alerts with optimized threshold parameters have been integrated into our monitoring tools for improved oversight.
- Revisited existing runbooks to remediate the process by focusing on cluster resource management and job configurations specific to clusters. Team proactively worked on optimizing the purge jobs to run for defined block periods rather than successive hours, enhancing efficiency and minimizing the chances of errors during the purge process.
- Enhanced the resource pool for critical applications for a possible tuning, including storage expansion to retain 2x data on clusters and optimized data retention policies to reduce reliance on cold storage.
The result of this endeavour will lead to enhanced engineering procedures, and implementation of rigorous release protocols, all aimed at ensuring greater stability and to avert occurrences in the future that would potentially result in causing a similar incident.