Intermittent service interruption in LiveRamp Based Targeting and Capping in for US West Coast Ad Requests
Incident Report for Xandr
Postmortem

INCIDENT SUMMARY

From approximately 04:40 UTC on Thursday, May 25 to 13:40 UTC on Friday, May 26, 2023, 100% of request were timing out on the Azure US West LiveRamp Prod Cluster.


INCIDENT IMPACT

  • Nature of Impact(s):
  • Some frequency caps were intermittently violated.
  • Drop in delivery on external supply.
  • Timeframe: ~18.06 Hours. 04:40 UTC on Thursday, May 25 to 13:40 UTC on Friday, May 26, 2023.
  • The first resolution of the incident occurred at 14:44 on 2023-05-25.
  • Issue reoccurred and the incident was reopened, which was later resolved for the second time at 13:40 on 2023-05-26.
  • Scope: North America – West Coast (LAX1)
  • Components:
  • Ad Serving
  • Bidding
  • User data


TIMELINE (UTC)

  • 2023-05-25 04:40: Incident Started. Team received alerts from our internal monitoring tool for increase in latency between LAX1 datacenter and Azure.
  • 2023-05-25 10:33: IM Ticket Created.
  • 2023-05-25 05:22: Escalated to Engineer. Engineering team began investigating the latency issue, vendors were engaged for further troubleshooting.
  • 2023-05-25 14:44: Incident Resolved. Issue was attributed to our vendors, and was resolved by itself.
  • 2023-05-26 05:38: Issue reoccurred and the IM was reopened.
  • 2023-05-26 13:40: The issue was resolved once the circuit maintenance activity was completed by our vendors.


CAUSE ANALYSIS

The issue was attributed to our vendor where a maintenance activity was being carried out on their internal circuits. Due to this, the LAX1 datacenter to Azure infrastructure experienced a delay in response time which had a significant impact on latency-sensitive traffic. This resulted in all requests timing out on the Azure US West Liveramp Prod Cluster.

 

RESOLUTION STEPS

The issue resolved on its own once the maintenance activity was completed by our vendor’s carrier, returning the latency to normal which fully restored the network.

 

FOLLOW-UP ITEMS

Engineering team continues to follow-up with the vendor for a detailed RCA.

Posted Jun 05, 2023 - 12:52 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted May 25, 2023 - 23:20 UTC
Monitoring

We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.

Posted May 25, 2023 - 15:13 UTC
Investigating

We are currently investigating the following issue::

  • Component(s): Bidding, Ad Serving, Userdata
  • Impact(s):
    • Some frequency caps may be intermittently violated
    • Drop in delivery on external supply
  • Not Impacted:
    • UI
    • API
  • Geolocation(s): North America - West Coast (LAX1)

We will provide an update as soon as more information is available. Thank you for your patience.

Posted May 25, 2023 - 11:42 UTC