Prebid Cache in LAX Down
Incident Report for Xandr
Postmortem

Incident Summary
POP (Programmatic OTT with Prebid) was down in the LAX datacenter from approximately 20:03 UTC to 21:04 UTC on Tuesday, May 19, 2020. AMP and Video traffic on Prebid Server was affected as well during this time.

Scope of Impact
During this time frame POP (Programmatic OTT with Prebid) was down in the LAX datacenter. AMP and Video traffic on Prebid Server were affected as well.

Timeline (UTC)

2020-05-19 20:03: Incident Started: Resources were removed from a load-balancing group

2020-05-19 20:08: Incident Investigated. Engineers were alerted to the issue via an excessive pod restart and unavailable replicas alert for Prebid cache in LAX

2020-05-19 20:35: Incident Ticket created

2020-05-19 21:04: Incident Resolved: Removed resource added back to the load-balancing group, resolving the issue

Cause Analysis
This incident was caused by resources being erroneously removed from a load-balancing group, causing affected systems to crash.

Resolution Steps
A host was added back to the load-balancing group, resolving the issue.

Next Steps

  • Set up additional alerts to notify if there are 0 resources up in a load-balancing group
  • Roll out additional training for the database management team
Posted Jun 08, 2020 - 16:08 UTC

Resolved

The following incident has been fully resolved, and we will post a post-mortem as soon as we have completed one:

  • Component(s): Prebid
  • Impact(s):
    • Programmatic OTT with Prebid down, AMP and video Prebid Server demand affected for one hour
  • Severity: Minor Outage
  • Datacenter(s): LAX1

We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted May 20, 2020 - 19:33 UTC