From approximately 08:49 UTC to 09:19 UTC on Friday, November 1st and 07:10 UTC to 08:00 UTC on Monday, November 4th, temporary database connection issues prevented many users in Europe from authenticating with our products.
During the two incident windows, users nearest our Amsterdam (AMS1) data center were unable to log in.
2019-11-01 08:49: Incident started
2019-11-01 09:11: Issue escalated to engineering team
2019-11-01 09:19: Incident resolved itself
2019-11-04 07:10: Incident reoccurred
2019-11-04 08:00: Incident resolved itself
2019-11-04 11:14: DNS resolution service moved to separate hardware
DNS resolution for our Kubernetes-hosted infrastructure (kube-dns) encountered severe load issues while running its workloads on the Kubernetes master node, resulting in rapid container restarts and ultimately failed connections.
DNS resolution was moved to a separate worker node, mitigating the problem of competing system-critical workloads.
* Improve logging for similar issues
* Add monitoring and alerting for DNS resolution health
The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.
We have identified and patched the following issue:
We are monitoring our systems closely, and will provide an update as soon as the issue has been fully resolved.