System Outage
Incident Report for RevolutionParts, Inc
Postmortem

RevolutionParts apologize for the inconvenience caused by the outage on Jan 24th, 2023. This post-mortem explains the timeframe, the systems impacted, and the steps we have taken to ensure we have addressed the root cause.

What Happened

At 11:38 PM MST on January 23rd, 2023, our monitoring systems alerted that the checkout and manage (supplier back-office) applications were unreachable, preventing end customers/parts buyers from placing orders. Our site reliability engineering team immediately started investigating and working to restore the affected systems. The incident was declared over at 5:00 AM MST on January 24th, 2023. The outage was during a period of the lowest sales (at night), and the impact on lost or delayed sales were low. At no point during this incident was data at risk of loss.

Root Cause & Remediation

Our applications run in the cloud through containerization technology. That technology is managed by a control plane. A configuration change to the control plane reverted a version of a networking driver that caused the control plane to become unstable/unreachable. The site reliability engineering team rebuilt the control plane with the correct version to recover the checkout and manage applications. The correct networking driver is now fixed to avoid the incompatibility we experienced.

Posted Jan 30, 2023 - 11:23 MST

Resolved
RevolutionParts experienced a system outage affecting online storefronts, checkout, and order management from Jan 24th, 2023 starting at approximately 12:00 AM MST and lasted through approximately 5:30 AM MST. This was a technical issue in our infrastructure that was now both identified and resolved - all systems were secure during this incident and no data was lost. In the following days, we will post a post-mortem with additional details. We apologize for inconvenience this caused.
Posted Jan 24, 2023 - 00:30 MST