Requests to Real-Time Data Providers Dropped

Incident Report for Xandr

Postmortem

Incident Summary

From approximately 22:00 UTC on Friday, January 8th, 2021 to 02:00 UTC on Saturday, January 9th, 2021 the Impression Bus stopped sending requests to all data providers in all datacenters. From approximately 02:00 to 12:00 UTC on Saturday, January 9th, requests to all data providers in all datacenters were gradually restored until the issue was resolved.

Scope of Impact

During the incident window, data providers did not receive requests from the Xandr Impression Bus. Clients leveraging data provider segments for buying may have experienced drops in delivery of buy-side objects targeting those segments.

Timeline (UTC)

2021-01-08 22:15: Incident Started.

2021-01-09 00:41: Incident Escalated and Engineering Notified.

2021-01-09 01:45: Problematic data provider causing incident identified.

2021-01-09 01:50: Potential mitigation tested: Impression bus rolled back to older configuration.

2021-01-09 03:55: All impression buses rolled back.

2021-01-09 04:37: First 10% of impression buses resume sending requests to data providers.

2021-01-09 13:50: Incident Resolved: Mitigation roll back complete. All data providers receiving full request stream.

Cause Analysis

The incident was caused by an Impression Bus procedure for applying (de-)activation and data provider toggling for bidder objects, which improperly responded to a series of changes against a specific bidder object. This resulted in incorrect accounting for all data providers, and stopped all requests to data providers.

Resolution Steps

Our engineers resolved the issue by inactivating the offending bidder object and rolling back Impression Bus to a previous version.

Next Steps

Implement additional monitoring to identify these drops.
Address underlying impression bus logic that introduced this defect.

Posted Jan 14, 2021 - 17:23 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Jan 11, 2021 - 14:01 UTC

Monitoring

We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.

Posted Jan 09, 2021 - 14:14 UTC

Identified

We have identified the cause of the issue, and our engineers are actively working towards a resolution. We will provide an update as soon as possible. Thank you for your patience.

Posted Jan 09, 2021 - 02:13 UTC

Investigating

We are currently investigating the following issue:

Component(s): Userdata
Impact(s):
- Real time data provider segment targeting failing
Severity: Minor Outage
Datacenter(s): Global

We will provide an update as soon as more information is available. Thank you for your patience.

Posted Jan 09, 2021 - 01:13 UTC