RevolutionParts apologize for the inconvenience caused by the outage on Jan 24th, 2023. This post-mortem explains the timeframe, the systems impacted, and the steps we have taken to ensure we have addressed the root cause.
At 11:38 PM MST on January 23rd, 2023, our monitoring systems alerted that the checkout and manage (supplier back-office) applications were unreachable, preventing end customers/parts buyers from placing orders. Our site reliability engineering team immediately started investigating and working to restore the affected systems. The incident was declared over at 5:00 AM MST on January 24th, 2023. The outage was during a period of the lowest sales (at night), and the impact on lost or delayed sales were low. At no point during this incident was data at risk of loss.
Our applications run in the cloud through containerization technology. That technology is managed by a control plane. A configuration change to the control plane reverted a version of a networking driver that caused the control plane to become unstable/unreachable. The site reliability engineering team rebuilt the control plane with the correct version to recover the checkout and manage applications. The correct networking driver is now fixed to avoid the incompatibility we experienced.