Incident Summary
From approximately 20:30 UTC on January 24, 2020 to 22:34 UTC on January 24, 2020, there was a networking issue with our core switch that prevented some instances from obtaining IP addresses dynamically. This caused some instances to go offline.
Scope of Impact
During the incident window, offline instances caused services to be unavailable. This includes:
* Inability to send out reports
* Unable to preview creatives
Timeline (UTC)
Friday 24 January 2020 20:34 UTC First report of down instances
Friday 24 January 2020 21:41 UTC Engineering confirmed the Networking issue
Friday 24 January 2020 21:42 UTC The team started working to manually bring up the affected
Friday 24 January 2020 22:34 UTC Root cause identified on the core switch
Friday 24 January 2020 22:36 UTC Confirmed all affected instances are back up
Cause Analysis
The incident was caused by a failure on the core switch with the DHCP service, which caused many instances to become offline when they could not renew their IP addresses. Offline instances led to downed services.
Resolution Steps
Once the DHCP service on the core switch was brought online, all instances came back onto the network resulting in restored service.
Next Steps
1. Issue has been raised with the hardware vendor to identify the root cause
2. We are looking into converting core and critical instances to not rely on DHCP to stay online
The following incident has been fully resolved, and we will post a post-mortem as soon as we have completed one:
We apologize for the inconvenience this issue may have caused, and thank you for your continued support.