Quick guide for Root Cause Analysis

4 min readMar 20, 2022

As Wikipedia states: Root Cause Analysis (RCA) is a method of problem-solving used for identifying the root causes of faults or problems.

It can be decomposed into four steps

Identify and describe the problem clearly
Establish a timeline from normal situation up to the time the problem occurred
Distinguish between the root cause and other causal factors
Establish a causal graph between the root cause and the problem

From our analytics point of view, let me add an additional step here: Document the findings for future reference.

Let's take this instance

From the area chart above we have a scenario where we saw this dip in orders on the 24th. As we can see that this is an irregular behavior, we want to study this and share the findings with business users, so that required steps can be taken to rectify this.

Broadly speaking there are two of dealing with faults, errors, or problems (Wikipedia)

Reactive Management: Reacting quickly after the problem occurs. In our case the ‘quickly’ is as soon as data is available in our reports. In very sensitive cases, we have to build systems that identify the faults quickly and send our alerts, as for reports, there is a delay in getting the data (when the fault/problem occurred) and reflecting it in reports. In our case, let’s take a bigger window to solve the problem, for ease of demonstration
Proactive Management: It means preventing the problem from occurring. In a real-world scenario (especially with data pipelines) it’s difficult to practice this, especially when data is so dynamic. But based on our findings from reactive management and resolution, we can tweak our solution to be proactive and ready, when a similar issue occurs

We can now define the problem statement: why did the number of orders decrease on the 24th?

STEP 1: Is the data correct?

Before dwelling deep into what caused the dip, we should make sure that the data represented in the report is correct. A few of the situations that can cause the report to show inaccurate numbers include

Change in the metric definition: If the business logic for defining the KPI changes, we might see changes in the report.
Reporting Tool update: In various tools, we have extracts that might not have been updated for the latest date.
Data source changes: Are the views at the data source updated, any failed ETL pipelines which might have reduced orders volume on 12th

We can create a checklist for all these tests. For any irregularities that we find in the future, it’s best to check these before moving on to the next steps.

Let’s say we did all the checks and found no issues with data. The decline in order is an actual event. We can now move on to the next step

STEP 2: Checking Platform Changes

Are there any recent changes to your website or change? Is the decline observed only for mobile users, for any particular mobile os, or is spread throughout? Were there any new features added to the platform? These questions help us pinpoint if platform changes made it difficult for customers to make orders. For eg let’s say you had an issue with credit card payments because of recent changes to payment gateway settings. When you split order by payment type you would be able to see this clearly.

Till now we have complete control over what tests we can do. Moving on to the next steps, it would require a wide range of tests till we get to a solution and the difficulty would increase as we move to outer causal elements.

Step 3:Causes

Is the decline seasonal? For eg on Christmas, National holidays people order less from my website, as I can confirm from the previous year’s data. If we are maintaining history(which we should) this would be an easy check. It’s a good practice to maintain documentation for these seasonal trends and trigger calendar dates where orders decline
DrillDown by categories. We can breakdown the decline in orders by different attributes like geography, product types, user types to see if it is any particular segment that’s affected or if it is concentrated for some attribute. For eg, we can have a case where the decline was majorly caused by decreased orders from India. We can then focus our analysis in that direction. It’s a similar exercise we did for payment types when reviewing platform changes
Price Change: Is there any recent changes to prices, or new pricing strategy
A new competitor: Is there a new competitor in the market, or any promotions/sales running there, that might have caused outflux of your customers to your competition

Putting it all together | Original Image

Once we have identified the root cause we should document it with previous Root cause Analysis (RCA) experiments, which can be used as a reference later on. Platform and data checks are easier to streamline and standardize, while external changes, including changed user behavior are different to measure and analyze.

Quick guide for Root Cause Analysis

STEP 1: Is the data correct?

STEP 2: Checking Platform Changes

Step 3:Causes

Written by Ankit Madhukar