Networking issue in our NYM data center

Incident Report for Xandr

Postmortem

Incident Summary
From approximately 20:30 UTC on January 24, 2020 to 22:34 UTC on January 24, 2020, there was a networking issue with our core switch that prevented some instances from obtaining IP addresses dynamically. This caused some instances to go offline.

Scope of Impact
During the incident window, offline instances caused services to be unavailable. This includes:
* Inability to send out reports
* Unable to preview creatives

Timeline (UTC)
Friday 24 January 2020 20:34 UTC First report of down instances
Friday 24 January 2020 21:41 UTC Engineering confirmed the Networking issue
Friday 24 January 2020 21:42 UTC The team started working to manually bring up the affected
Friday 24 January 2020 22:34 UTC Root cause identified on the core switch
Friday 24 January 2020 22:36 UTC Confirmed all affected instances are back up

Cause Analysis
The incident was caused by a failure on the core switch with the DHCP service, which caused many instances to become offline when they could not renew their IP addresses. Offline instances led to downed services.

Resolution Steps
Once the DHCP service on the core switch was brought online, all instances came back onto the network resulting in restored service.

Next Steps
1. Issue has been raised with the hardware vendor to identify the root cause
2. We are looking into converting core and critical instances to not rely on DHCP to stay online

Posted Jan 30, 2020 - 17:02 UTC

Resolved

The following incident has been fully resolved, and we will post a post-mortem as soon as we have completed one:

Component(s): Creative Console pages, Console API, Log Level Data, Analytics reports, Creative preview
Impact(s):
- Stale reporting data
- Latency, timeouts and errors for affected service
- Slower report retrieval
- Page load failures and errors in user interface
- Unable to preview creatives
- Latency, timeouts and errors in API
Severity: Minor Outage
Datacenter(s): NYM2

We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Jan 25, 2020 - 17:39 UTC