Network Analytics/Video Analytics Reporting and LLD Feed Delays

Incident Report for Xandr

Postmortem

Network Analytics/Video Analytics Reporting and LLD Feed Delays



Incident Summary

From approximately 16:11 UTC on Tuesday, Sep 19 to 23:12 UTC on Wednesday, Sep 20, 2023, users experienced a delay in the delivery of LLD data, and reporting data was not accessible through the UI, during the impact window.



Incident Impact

  • Nature of Impact(s): Some data latency and consequently reports or feeds for some hours will be late.
  • Timeframe: ~31.07 Hours. 16:11 UTC on Tuesday, Sep 19 to 23:12 UTC on Wednesday, Sep 20, 2023.
  • Scope: Global
  • Components: Reporting and LLD

 


Timeframe (UTC)

  • 2023-09-19 16:11: Incident Started. Application failed the health check
  • 2023-09-19 17:08: The scheduler was restarted due to it being unresponsive or hung.
  • 2023-09-19 19:31: IM was created
  • 2023-09-19 21:30: The scheduler achieved stability on the node that had previously experienced a failure.
  • 2023-09-20 23:12: Incident Resolved.



Cause Analysis

The issue was caused due to an on-going database query locks. This issue originated from a non-responsive scheduler, which failed to schedule jobs and process messages from the queue. Further investigations revealed that threads were stuck, holding a backlog of messages without acknowledgment, due to indefinite database operation timeouts that caused lock contention. This resulted in users experiencing a delay in the delivery of LLD data, and reporting data was not accessible through the UI, during the impact window.

 

Resolution Steps

The issue resolved on its own as there was no further delay remaining in the messaging queue for the system to process. As a result, the code running on the failed node regained its stability.



Follow-up Items

  • Inspected relevant jobs and added additional metrics around DB operations to enable slow query logging.
  • Revisited and enabled duration timeouts around DB operations. This would ensure to avert similar occurrences in future that would potentially result in causing a similar IM.

Posted Oct 06, 2023 - 13:49 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Sep 20, 2023 - 06:38 UTC

Monitoring

We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.

Posted Sep 20, 2023 - 02:45 UTC

Investigating

We are currently investigating the following issue::

  1. Component(s): Reporting, LLD
  2. Impact(s):
    • Some data latency and consequently reports or feeds for some hours will be late.
  3. Not Impacted:
    • Ad Serving
    • Bidding
  4. Geolocation(s): Global (Global)

Status: We will provide an update as soon as more information is available. Thank you for your patience.

Posted Sep 19, 2023 - 22:37 UTC