Log in not working

Incident Report for Xandr

Postmortem

Incident Summary

From approximately 08:49 UTC to 09:19 UTC on Friday, November 1st and 07:10 UTC to 08:00 UTC on Monday, November 4th, temporary database connection issues prevented many users in Europe from authenticating with our products.

Scope of Impact

During the two incident windows, users nearest our Amsterdam (AMS1) data center were unable to log in.

Timeline (UTC)

2019-11-01 08:49: Incident started
2019-11-01 09:11: Issue escalated to engineering team
2019-11-01 09:19: Incident resolved itself
2019-11-04 07:10: Incident reoccurred
2019-11-04 08:00: Incident resolved itself
2019-11-04 11:14: DNS resolution service moved to separate hardware

Cause Analysis

DNS resolution for our Kubernetes-hosted infrastructure (kube-dns) encountered severe load issues while running its workloads on the Kubernetes master node, resulting in rapid container restarts and ultimately failed connections.

Resolution Steps

DNS resolution was moved to a separate worker node, mitigating the problem of competing system-critical workloads.

Next Steps

* Improve logging for similar issues
* Add monitoring and alerting for DNS resolution health

Posted Nov 08, 2019 - 09:10 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Nov 01, 2019 - 13:13 UTC

Monitoring

We have identified and patched the following issue:

Component(s): Console UI
Impact(s):
- Some users unable to log in
Severity: Major Outage
Datacenter(s): AMS1

We are monitoring our systems closely, and will provide an update as soon as the issue has been fully resolved.

Posted Nov 01, 2019 - 09:48 UTC