Find the Retention Rates – Salesforce SQL Interview Question

How to find the retention rates


Retention rates are one of the key business metrics. We’ll show you how to calculate them by explaining in detail how to solve the Salesforce data science interview question.

The retention rate is one of the important business metrics, especially in marketing, investing, and product management.

It refers to the percentage of customers continuing to do business with a company. This usually means extending your subscription or in any other way continuing to use the company’s products and services, such as software, application, maintenance, etc.

The retention rate is calculated by dividing the number of retained customers by the number of customers at the beginning of the period. The number of retained customers shouldn’t include customers acquired during the monitored period. In other words, the formula is:

PECNCPSC\frac {PEC - NC}{PSC}

PEC – Period End Number of Customers
NC – New Customers in the Period
PSC – Period Start Number of Customers

Now, we’ll have a look at the interview question and try to find the retention rates using SQL.

Retention Rate - A Data Science Interview Question by Salesforce

Here’s what this question asks you:


Table: sf_events

Link to the question: https://platform.stratascratch.com/coding/2053-retention-rate

Dataset to Work With

To solve this problem, Salesforce gives you only one table: sf_events.

It has three columns:

datedatetime
account_idvarchar
user_idvarchar

To get an idea about the data it contains, here are the first few rows from the table:

Table: sf_events
dateaccount_iduser_id
2021-01-01A1U1
2021-01-01A1U2
2021-01-06A1U3
2021-01-02A1U1
2020-12-24A1U2

Solution Approach

Since this table is not a list of all users, but the list of users’ activity in each month, we don’t need to calculate the number of new users each month. In other words, we want to see how many users active in December 2020 were also active in January 2021 or any other future month. We also need to have a look at all the users active in January 2021 and see were they active in February 2021 or any other future month. This is also the assumption stated in the question.

With this assumption in mind, the retention rate is calculated by finding the users active in future months and dividing this number by the number of users in December 2020 or January 2021, depending on which retention rate you’re calculating.

For example, if the user were active in December 2020, it would appear in a table with a December 2020 timestamp. If there’s any future activity (in January 2021 or on), this user would be considered as retained for December 2020. If the user were active in December 2020 but didn’t appear in any of the coming months, it would be considered not retained.

Assumptions

Our solution will be based on the following assumptions:

  • If a user is listed in the table, this represents the user’s activity for the date in the record.
  • We consider only retention rates for Dec 2020 and January 2021.
  • The table does not represent the list of all users but only the active users.

Solution Breakdown

The steps you have to build into your code are:

  1. Find all active users in December 2020 by using the date field. Do the same for January 2021. That way, you’re getting denominators for the Dec and Jan retention rates.
  2. Find the maximum date of the user’s activity to see if the user has the activity in the future months. To do that, create a table with the user_id and max date.
  3. Join all the active users in the month with the list of users with future activity. That way, you’ll get the list of December 2020 users and their latest activity date. Then count the number of users with activity after Dec and divide it by the number of users in Dec to get the Dec retention rate. Apply the same principle to calculate the January retention rate. Mind the fact that for Jan retention rate, the future activity begins with February 2021.
  4. Consolidate by account_id. Use either Jan or Dec accounts list because it’s assumed that both months contain the complete list of account_ids.

Solution

The first thing is to find users active in December 2020.

WITH dec_2020 AS
  (SELECT DISTINCT account_id,
                   user_id
   FROM sf_events
   WHERE EXTRACT(MONTH
                 FROM date) = 12
     AND EXTRACT(YEAR
                 FROM date) = 2020 ),

To do that, we’re using the CTE. We’re interested in the distinct accounts and users, and to get the users active in December 2020, we’re using the EXTRACT() function in the WHERE clause.

The second CTE does the same thing for the users active in January 2021.

  jan_2021 AS
  (SELECT DISTINCT account_id,
                   user_id
   FROM sf_events
   WHERE EXTRACT(MONTH
                 FROM date) = 1
     AND EXTRACT(YEAR
                 FROM date) = 2021 ),

Next we want to find the latest active date for each user.

max_date AS
  (SELECT user_id,
          MAX(Date) AS max_date
   FROM sf_events
   GROUP BY user_id),

As you could probably say from the solution breakdown, here we’ll use the MAX() function to find the latest active date.

Now comes the step where we calculate the retention rate. First the December 2020 retention rate.

 retention_dec_2020 AS
  (SELECT account_id,
          SUM(CASE
                  WHEN max_date > '2020-12-31' THEN 1.0
                  ELSE 0
              END) / COUNT(*) * 100.0 AS retention_dec
   FROM dec_2020
   JOIN max_date ON dec_2020.user_id = max_date.user_id
   GROUP BY account_id),

Here, we joined the two CTEs together to match active users in Dec with users that have had future activity. It’s possible that the future activity is in December. Because of that, we’ll only count the users that had activity after Dec.

We used the CASE WHEN statements to allocate values of 1 to all users that had activity after December 2020. Sum these values, divide them by the total number of users in December 2020, and you get the Dec retention rate.

Then we do the same for January 2021 retention rate.

retention_jan_2021 AS
  (SELECT account_id,
          SUM(CASE
                  WHEN max_date > '2021-01-31' THEN 1.0
                  ELSE 0
              END) / COUNT(*) * 100.0 AS retention_jan
   FROM jan_2021
   JOIN max_date ON jan_2021.user_id = max_date.user_id
   GROUP BY account_id)

Now that we have the retention rate for Dec and Jan active users, we only need to group by account_id and divide the retentions.

SELECT retention_jan_2021.account_id,
       retention_jan / retention_dec AS retention
FROM retention_jan_2021
INNER JOIN retention_dec_2020 ON retention_jan_2021.account_id = retention_dec_2020.account_id

The complete answer to this question is:

WITH dec_2020 AS
  (SELECT DISTINCT account_id,
                   user_id
   FROM sf_events
   WHERE EXTRACT(MONTH
                 FROM date) = 12
     AND EXTRACT(YEAR
                 FROM date) = 2020 ),

 jan_2021 AS
  (SELECT DISTINCT account_id,
                   user_id
   FROM sf_events
   WHERE EXTRACT(MONTH
                 FROM date) = 1
     AND EXTRACT(YEAR
                 FROM date) = 2021 ),

max_date AS
  (SELECT user_id,
          MAX(Date) AS max_date
   FROM sf_events
   GROUP BY user_id),

 retention_dec_2020 AS
  (SELECT account_id,
          SUM(CASE
                  WHEN max_date > '2020-12-31' THEN 1.0
                  ELSE 0
              END) / COUNT(*) * 100.0 AS retention_dec
   FROM dec_2020
   JOIN max_date ON dec_2020.user_id = max_date.user_id
   GROUP BY account_id),

retention_jan_2021 AS
  (SELECT account_id,
          SUM(CASE
                  WHEN max_date > '2021-01-31' THEN 1.0
                  ELSE 0
              END) / COUNT(*) * 100.0 AS retention_jan
   FROM jan_2021
   JOIN max_date ON jan_2021.user_id = max_date.user_id
   GROUP BY account_id)

SELECT retention_jan_2021.account_id,
       retention_jan / retention_dec AS retention
FROM retention_jan_2021
INNER JOIN retention_dec_2020 ON retention_jan_2021.account_id = retention_dec_2020.account_id

Edge Case Consideration

As an edge case, we’ll consider the possibility that not all accounts were present each month.

To compensate for that and to include all accounts, you can use two workarounds.

FULL OUTER JOIN

The first workaround is to use the FULL OUTER JOIN instead of INNER JOIN in the SELECT statement referencing the CTEs.

SELECT 
    COALESCE(retention_jan_2021.account_id, retention_dec_2020.account_id) AS account_id,
    COALESCE(retention_jan, NULL) / COALESCE(retention_dec, NULL) AS retention
FROM retention_jan_2021
FULL OUTER JOIN retention_dec_2020 ON retention_jan_2021.account_id = retention_dec_2020.account_id

Use the COALESCE function to get the January accounts and the December accounts not appearing in January. Then use the same function to divide the two retention rates, with NULL when there’s no retention rate for that account.The CTEs calculating the retention rates are joined using the FULL OUTER JOIN. If you don’t feel at home with all these different JOINs and what they do, don’t worry! Here’s an article “How to Join 3 or More Tables in SQL” that explains everything about the JOINs you need to know.

The issue with this edge case solution is that it’s computationally intensive.

UNION

There’s another way. You can get a complete list of all accounts by using UNION, like this:

all_accounts AS
  (SELECT account_id
   FROM retention_jan_2021
   UNION 
   SELECT account_id
   FROM retention_dec_2020)

SELECT a.account_id,
       COALESCE(retention_jan, NULL) / COALESCE(retention_dec, NULL) AS retention
FROM all_accounts a
LEFT JOIN retention_jan_2021 j ON a.account_id = j.account_id
LEFT JOIN retention_dec_2020 d ON a.account_id = d.account_id

Both these workarounds have a downside, which is they only capture the accounts that are in December and January. This means they don’t consider all months in the dataset.

If you want all months, you can simply create a table with all the distinct account IDs found in the table. This would, however, mean listing all the accounts for all time, so you may get a lot of accounts with retention being zero because they don’t have any users.

Conclusion

This Salesforce data science interview question is not easy. But if you hung on in there until the end, you have gotten really valuable knowledge. That is calculating the retention rates.

Knowing that will not only get you a bigger chance of success at the job interview. It will also make you a valuable asset to a company, because you’ve shown that you possess a high level of business, as well as technical knowledge. If you want to practice more questions from Salesforce, check out our previous post “Salesforce Data Scientist Coding Interview Questions” or you can also find questions from other top companies here “SQL Interview Questions You Must Prepare: The Ultimate Guide”.

How to find the retention rates


Become a data expert. Subscribe to our newsletter.