API / UI errors on line items and creatives
Incident Report for Xandr
Postmortem

Incident Summary

From approximately 07:48 to 10:11 UTC on Tuesday, September 7th, 2021 clients encountered errors in the UI when working with line items and creatives.

Scope of Impact

During the incident window, customers began seeing errors in the UI and were unable to save new or edited creatives and line items.

Timeline (UTC)

2021-09-07 07:48: Incident Started: concurrent, high-cost requests caused database timeouts which led to an increase in error responses to customers
2021-09-07 09:08: Incident Escalated to engineers
2021-09-07 10:11: Incident Resolved: back-end API services restarted, resetting database connections

Cause Analysis

The incident was caused by an increase in queries which caused intermittent latency and errors. Our API's design currently prevents proper throttling of high-cost calls that tie up database connections. A decrease in available database resources began causing other API calls to fail or timeout. This ultimately resulted in a high level of intermittent errors that caused our platform to become unstable.

Resolution Steps

Our engineers resolved the issue by restarting multiple back-end API services which freed up the database connections and began allowing requests to successfully return.

Next Steps

  • Mitigate the impact of concurrent, high-cost requests to our API services
  • Upgrade our databases to Ubuntu 20.04, which offers improved performance
  • Develop a new API to provide deltas of large / high-cost database sources
  • Upgrade one of our back-end apps to add a functionality that increases observability through several services
Posted Sep 16, 2021 - 15:37 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Sep 07, 2021 - 11:37 UTC
Monitoring

We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.

Posted Sep 07, 2021 - 10:22 UTC
Investigating

We are currently investigating the following issue:

  • Component(s): Buy-side pages, API, Creative pages
  • Impact(s):
    • Page load failures and errors in user interface
    • Unable to save/edit objects
    • Unable to upload or preview creatives
    • Latency, timeouts and errors in API
  • Severity: Major Outage
  • Datacenter(s): Global

We will provide an update as soon as more information is available. Thank you for your patience.

Posted Sep 07, 2021 - 09:23 UTC