Database lag limiting access to created objects
Incident Report for Xandr
Postmortem

Incident Summary
From approximately 14:21 UTC to 15:00 UTC on Wednesday, October 9th, users may have been unable to access created objects due to delays in database replication.

Scope of Impact
During the incident window, users may have experienced issues with accessing objects they created in the UI/API, including unexpected UI/API errors.

Timeline (UTC)
2019-10-09 14:21: Incident started: High replication lag noticed by engineering.
2019-10-09 14:43: Issue escalated as incident and investigation proceeds.
2019-10-09 14:45: Issue reported on status.xandr.com.
2019-10-09 14:47: Rate limits added to specific users for domain-list service.
2019-10-09 15:00: Incident resolved: replication lag recovered for all database instances.

Cause Analysis
A large spike in DELETE traffic to the domain-list service slowed down MySQL replication.

Resolution Steps
Our engineers resolved the issue by reducing rate limits for the user making the large number of DELETE requests.

Next Steps
* Re-architect the domain-list service to be able to delete large amounts of data at a more regulated pace
* Add max number of domain-list URLs that can be part of a list.

Posted Oct 14, 2019 - 18:43 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Oct 09, 2019 - 16:45 UTC
Monitoring

We have identified and patched the following issue:

  • Component(s): Console API, Console UI
  • Impact(s):
    • Possible UI and API unable to access created objects, may result in unexpected UI and API errors
  • Severity: Minor Outage
  • Datacenter(s): Global

We are monitoring our systems closely, and will provide an update as soon as the issue has been fully resolved.

Posted Oct 09, 2019 - 15:23 UTC