Database lag limiting access to created objects
Incident Report for Xandr
Postmortem

Incident Summary

From approximately 12:10 UTC on Saturday, October 12th to 18:35 on Monday, October 14th users may have been unable to access created objects or observe recent changes to objects due to delays in database replication.

Scope of Impact

During the incident window, users may have experienced issues with accessing objects they created or edited in the UI/API, including unexpected UI/API errors.

Timeline (UTC)

2019-10-12 12:10: Incident started: Altering a table applied IDs in a different order on some replica databases.
2019-10-13 20:09: Replication stalls on replica databases where IDs are in a different order.
2019-10-13 22:13: Issue escalated as incident and investigation proceeds.
2019-10-13 22:25: Engineering began to shift traffic to in-sync database hosts.
2019-10-13 22:45: Critical traffic shift completed.
2019-10-14 18:53: Client impact of incident mitigated.
2019-10-15 12:04: Incident resolved. Traffic shifted back to rebuilt host databases complete.

Cause Analysis

The alteration of a database table caused IDs to be applied in a different order on some replica databases that were on a different version.

Resolution Steps

Our engineers resolved the issue by shifting traffic to locations where replication of data was in-sync with the master database. A back up from a non-affected replica database was restored to affected replica databases. Once in-sync, traffic was shifted back to the repaired replica databases.

Next Steps

  • Building of a testing and QA environment that more accurately mirrors production environment having different database server versions.
Posted Nov 06, 2019 - 22:20 UTC

Resolved

This issue has been resolved but we continue to operate at a slightly reduced capacity and expect to return to full capacity within 24 hours. We apologize for the inconvenience this issue may have caused and thank you for your patience.

Posted Oct 15, 2019 - 00:46 UTC
Monitoring

We have patched the issue but are operating at a slightly reduced capacity. This might cause slower responses. However, adserving should not be affected. We apologize for the inconvenience this issue may have caused and thank you for your patience.

Posted Oct 14, 2019 - 02:15 UTC
Identified

We have identified the following issue:

  • Component(s): Console API, Console UI
  • Impact(s):
    • Possible UI and API unable to access created objects, may result in unexpected UI and API errors
  • Severity: Minor Outage
  • Datacenter(s): Global

Our engineers are actively working towards a resolution, and we will provide an update as soon as possible. Thank you for your patience.

Posted Oct 13, 2019 - 22:42 UTC