Order Home not loading Line Items/IO with error message being displayed
Incident Report for Xandr
Postmortem

Incident Summary

From approximately 07:43 to 09:09 UTC on Tuesday, July 28, 2020 clients were unable to access line items and insertion orders via the Order Home dashboard.

Scope of Impact

During the incident window, customers could not load line items and insertion orders via Order Home dashboard as they encountered a "Gateway Timeout" error.

Timeline (UTC)

2020-07-28 07:43: Incident started
2020-07-28 08:26: Incident escalated to on-call engineer
2020-07-28 09:09: Incident resolved

Cause Analysis

A backend process that reads line item / insertion order IDs from a queue and updates their status stopped processing, causing the queue to grow exponentially until it reached its maximum capacity. After several hours of attempting to process at maximum capacity, the resource load caused our caching mechanisms to fail and the system stopped responding to user requests.

Resolution Steps

Our engineers resolved the issue by swapping the production environment over to the backup database cluster. The primary database cluster was afterwards restarted and, after re-indexing, we were able to conclude that the issue was resolved.

Next Steps

  • Set up an alert that detects if our queues have reached the maximum capacity and stayed above that for a certain time
  • Status updates of the database need to be monitored
Posted Jul 31, 2020 - 16:07 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Jul 28, 2020 - 11:18 UTC
Monitoring

We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.

Posted Jul 28, 2020 - 09:22 UTC
Investigating

We are currently investigating the following issue:

  • Component(s): Order Home Dashboard
  • Impact(s):
    • Order Home not loading Line Items/IO
  • Severity: Major Outage
  • Datacenter(s): Global

We will provide an update as soon as more information is available. Thank you for your patience.

Posted Jul 28, 2020 - 08:49 UTC