Quick guide for Root Cause Analysis

Ankit Madhukar
4 min readMar 20, 2022
Photo by UX Indonesia on Unsplash

As Wikipedia states: Root Cause Analysis (RCA) is a method of problem-solving used for identifying the root causes of faults or problems.

It can be decomposed into four steps

  1. Identify and describe the problem clearly
  2. Establish a timeline from normal situation up to the time the problem occurred
  3. Distinguish between the root cause and other causal factors
  4. Establish a causal graph between the root cause and the problem

From our analytics point of view, let me add an additional step here: Document the findings for future reference.

Let's take this instance

Original Image

From the area chart above we have a scenario where we saw this dip in orders on the 24th. As we can see that this is an irregular behavior, we want to study this and share the findings with business users, so that required steps can be taken to rectify this.

Broadly speaking there are two of dealing with faults, errors, or problems (Wikipedia)

  1. Reactive Management: Reacting quickly after the problem occurs. In our case the ‘quickly’ is as soon as data is available in our reports. In very sensitive cases, we have to build systems that identify the faults quickly and send our alerts, as for reports, there is a delay in getting the data (when the fault/problem occurred) and reflecting it in reports. In our case, let’s take a bigger window to solve the problem, for ease of demonstration
  2. Proactive Management: It means preventing the problem from occurring. In a real-world scenario (especially with data pipelines) it’s difficult to practice this, especially when data is so dynamic. But based on our findings from reactive management and resolution, we can tweak our solution to be proactive and ready, when a similar issue occurs

We can now define the problem statement: why did the number of orders decrease on the 24th?

STEP 1: Is the data correct?

Before dwelling deep into what caused the dip, we should make sure that the data represented in the report is correct. A few of the situations that can cause the report to show inaccurate numbers include

  1. Change in the metric definition: If the business logic for defining the KPI changes, we might see changes in the report.
  2. Reporting Tool update: In various tools, we have extracts that might not have been updated for the latest date.
  3. Data source changes: Are the views at the data source updated, any failed ETL pipelines which might have reduced orders volume on 12th

We can create a checklist for all these tests. For any irregularities that we find in the future, it’s best to check these before moving on to the next steps.

Let’s say we did all the checks and found no issues with data. The decline in order is an actual event. We can now move on to the next step

STEP 2: Checking Platform Changes

Are there any recent changes to your website or change? Is the decline observed only for mobile users, for any particular mobile os, or is spread throughout? Were there any new features added to the platform? These questions help us pinpoint if platform changes made it difficult for customers to make orders. For eg let’s say you had an issue with credit card payments because of recent changes to payment gateway settings. When you split order by payment type you would be able to see this clearly.

Till now we have complete control over what tests we can do. Moving on to the next steps, it would require a wide range of tests till we get to a solution and the difficulty would increase as we move to outer causal elements.

Step 3:Causes

  1. Is the decline seasonal? For eg on Christmas, National holidays people order less from my website, as I can confirm from the previous year’s data. If we are maintaining history(which we should) this would be an easy check. It’s a good practice to maintain documentation for these seasonal trends and trigger calendar dates where orders decline
  2. DrillDown by categories. We can breakdown the decline in orders by different attributes like geography, product types, user types to see if it is any particular segment that’s affected or if it is concentrated for some attribute. For eg, we can have a case where the decline was majorly caused by decreased orders from India. We can then focus our analysis in that direction. It’s a similar exercise we did for payment types when reviewing platform changes
  3. Price Change: Is there any recent changes to prices, or new pricing strategy
  4. A new competitor: Is there a new competitor in the market, or any promotions/sales running there, that might have caused outflux of your customers to your competition
Putting it all together | Original Image

Once we have identified the root cause we should document it with previous Root cause Analysis (RCA) experiments, which can be used as a reference later on. Platform and data checks are easier to streamline and standardize, while external changes, including changed user behavior are different to measure and analyze.

--

--