Several ALI didn't start because one of our controller failed on its connection
Incident Report for Xandr
Postmortem

Incident Summary

From approximately 08:00 UTC on Sunday November 11th, 2020 to 13:00 UTC on Tuesday, November 2th, 2020 one of the two optimization controllers in the "Discovery" ecosystem stopped publishing data to bidders. For odd-numbered line items using a CPA or CPC optimization goal this may have results in no delivery, underdelivery, or poorer than typical performance.

Scope of Impact

During the incident window Line Items that were using optimization stopped getting Discovery updates, affecting Augmented Line Item's using a CPA or CPC optimization goal. Other line items, such as newly created ones or those having restricted delivery, may have not delivered.

Timeline (UTC)

2020-11-22 08:56: Incident Started.

2020-11-24 11:57: Incident Escalated.

2020-11-24 13:00: Routed to appropriate Engineering team and rerouting of underlying application.

2020-11-24 14:00: Incident Resolved

Cause Analysis

The incident was caused by the Discovery optimization controller losing connection to an internal system that maintains data. As a result of this connection loss data updates were no longer sent to Xandr bidders, causing delays in the optimization process.

Resolution Steps

Our engineers resolved the issue by switching out the primary datacenter on which the Discovery optimization controller relied. This allowed for optimization updates to be published as expected.

Next Steps

  • Identify why controller connection was lost, or why the instance couldn't recover the connection after losing it.
  • Add additional escalation paths for optimization issues.
Posted Nov 26, 2020 - 08:07 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Nov 24, 2020 - 17:32 UTC
Update

One of our Discovery Controllers lost its connection and stopped publishing records and updating the discovery state, so LIs using CPC or prospecting CPA optimization may have under-delivered.

Posted Nov 24, 2020 - 16:18 UTC
Monitoring

We are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.

Posted Nov 24, 2020 - 14:48 UTC
Investigating

We are currently investigating the following issue:

  • Component(s): Ad Serving
  • Impact(s):
    • New line items prevent from delivering
  • Severity: Major Outage
  • Datacenter(s): Global

We will provide an update as soon as more information is available. Thank you for your patience.

Posted Nov 24, 2020 - 14:43 UTC