Failed to console login and receive error for API
Incident Report for Xandr
Postmortem

Incident Summary

On Friday, 7 February, between 06:29 and 06:33 UTC, during the network maintenance in LAX1, Layer 2 network in LAX1 was isolated for approximately 4 mins. As a result, Anycast DNS from LAX1 is withdrawn from Internet and various connection issues were reported. Scope of ImpactConsole login and api was impacted from 06:30 to 07:00 UTC.

Timeline (UTC)

2020-02-07 06:18 Net engineer switched traffic
2020-02-07 06:29 Core Switch reboot
2020-02-07 06:30 Incident Start
2020-02-07 06:32 Sysops Team receives alerts
2020-02-07 06:37 Core Switch came up
2020-02-07 06:44 DBA team informed Net engineer team about Database down
2020-02-07 06:44 Sysops Team confirmed the alerts started to clear
2020-02-07 07:00 Incident Resolved
2020-02-07 07:12 Incident Ticket is created

Cause Analysis

During the rebooting Core Switch, when Core Switch's links were down due to reboot, Core Switch routing engine was overloaded.
Because of the overloading, connection issue happened.
LAX1 Layer 2 network became isolated from the Layer 3 network, Internet and other Data Centres for approximately 4 minutes while booting.

Resolution Steps

Connection issue is recovered by itself.

Next Steps

Prevention
With the current architecture, connection issue is not preventable for Layer 2 network and it has long Converge time compared to Layer 3.
But we could evaluate to shift application traffic to different DCs to have less customer impact if it's possible.

Posted Feb 21, 2020 - 02:13 UTC

Resolved

The following incident has been fully resolved, and we will post a post-mortem as soon as we have completed one:

  • Component(s): Console API, Console UI
  • Impact(s):
    • Page load failures and errors in user interface
    • Latency, timeouts and errors in API
    • Some users unable to log in
  • Severity: Minor Outage
  • Datacenter(s): SIN1, LAX1

We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Feb 07, 2020 - 07:54 UTC