Failure in Buyside API for AMS3 Datacenter

Incident Report for Xandr

Postmortem

Failure in Buyside API for AMS3 Datacentre



Incident Summary

From approximately 20:30 UTC on Thursday, Sep 14 to 22:15 UTC on Thursday, Sep 14, 2023, users were unable to access buyside API/UI pages, during the impact window.



Incident Impact

  • Nature of Impact(s):
  • Page load failures and errors in user interface
  • Latency, timeouts and errors in API
  • Timeframe: ~1.75 Hours. 20:30 UTC on Thursday, Sep 14 to 22:15 UTC on Thursday, Sep 14, 2023
  • Scope: Western Europe (AMS3)
  • Components:
  • Buy-side pages
  • Partner/Deal pages
  • API


Timeframe (UTC)


  • 2023-09-14 20:30: Incident Started.
  • 2023-09-14 20:30: Escalated to Engineer.
  • 2023-09-14 22:15: Incident Resolved.



Cause Analysis

The issue was caused due to a runtime error that resulted in out-of-memory error, due to the low memory limits that was set in the API containers when hyperthreading was introduced in the datacentre. As a result, the pods were OOM killed, and no pods remained operational to handle the traffic for a sufficient duration to effectively handle the traffic load.



Resolution Steps

The issue was resolved by increasing the memory limit of the API pods to accommodate the heightened memory demands in a hyperthreaded environment. The engineering team also collaborated to optimize the server configuration to effectively manage hyperthreading without necessitating additional memory usage.



Follow-up Items

  • Revisited and reviewed the resource pool for applications on the failed over cluster for a possible tuning.
  • Inspect relevant alerts and ensure the alerts are set with better threshold parameters to avert similar occurrences in the future that would potentially result in causing a similar IM


Posted Oct 06, 2023 - 18:26 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Sep 14, 2023 - 22:51 UTC

Monitoring

We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.

Posted Sep 14, 2023 - 22:23 UTC

Investigating

We are currently investigating the following issue::

  1. Component(s): Buy-side pages, Partner/Deal pages, API
  2. Impact(s):
    • Page load failures and errors in user interface
    • Latency, timeouts and errors in API
  3. Not Impacted:
    • Ad Serving
    • Bidding
  4. Geolocation(s): Western Europe (AMS3)

Status: We will provide an update as soon as more information is available. Thank you for your patience.

Posted Sep 14, 2023 - 21:22 UTC