QPS Limits Not Being Respected
Incident Report for Xandr
Postmortem

Incident Summary

From approximately 18:30 UTC to 23:00 UTC on October 17, 2019, QPS limits were disregarded for some bidders, causing the volume of bid requests to surge, which some bidders were not capable of handling.

Scope of Impact

During the incident window, bidders may have seen a spike in the number of bid requests they received, which exceeded the QPS limits they had set. Some bidders had to stop bidding in order to protect their infrastructure from becoming overwhelmed.

Timeline (UTC)

2019-10-17 18:30: External bidders started seeing their QPS limits exceeded.
2019-10-17 19:00: The root cause of the issue was identified.
2019-10-17 19:12: Rollback of code to resolve bug was released.
2019-10-17 21:28: Impbus revert is 70% released, resulting in the issue being resolved for affected clients.
2019-10-17 23:00: Release of fix fully complete.

Cause Analysis

This Incident was caused by a change in our impbus code, which was initially intended to improve QPS accuracy. The change did result in QPS limits respected for some traffic, but had an adverse effect on other types of traffic, resulting in the spike of requests for certain clients.

Resolution Steps

A rollback of this change resolved the issue; deployment of the fix started at 19:12 UTC and was confirmed to be working within 30 minutes; 70% of impbuses reverted by 21:30. Deployment fully completed at 23:00.

Next Steps

Implementation of additional QPS alerts in order to prevent this issue from reoccurring.

Posted Nov 06, 2019 - 19:44 UTC

Resolved

This incident has been resolved and all QPS should return to expect levels within the next hour. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Oct 17, 2019 - 21:35 UTC
Identified

We have identified the following issue:

  • Component(s): Ad Serving
  • Impact(s):
    • QPS Limits are being exceeded for requests sent to External Bidders & Real-Time Data Providers
  • Severity: Minor Outage
  • Datacenter(s): Global

Our engineers are actively working towards a resolution, and we will provide an update as soon as possible. Thank you for your patience.

Posted Oct 17, 2019 - 18:53 UTC