Find the Retention Rates – Salesforce SQL Interview Question
Retention rates are one of the key business metrics. We’ll show you how to calculate them by explaining in detail how to solve the Salesforce data science interview question.
The retention rate is one of the important business metrics, especially in marketing, investing, and product management.
It refers to the percentage of customers continuing to do business with a company. This usually means extending your subscription or in any other way continuing to use the company’s products and services, such as software, application, maintenance, etc.
The retention rate is calculated by dividing the number of retained customers by the number of customers at the beginning of the period. The number of retained customers shouldn’t include customers acquired during the monitored period. In other words, the formula is:
PEC – Period End Number of Customers
NC – New Customers in the Period
PSC – Period Start Number of Customers
Now, we’ll have a look at the interview question and try to find the retention rates using SQL.
Retention Rate - A Data Science Interview Question by Salesforce
Here’s what this question asks you:
Link to the question: https://platform.stratascratch.com/coding/2053-retention-rate
Dataset to Work With
To solve this problem, Salesforce gives you only one table: sf_events.
It has three columns:
To get an idea about the data it contains, here are the first few rows from the table:
Since this table is not a list of all users, but the list of users’ activity in each month, we don’t need to calculate the number of new users each month. In other words, we want to see how many users active in December 2020 were also active in January 2021 or any other future month. We also need to have a look at all the users active in January 2021 and see were they active in February 2021 or any other future month. This is also the assumption stated in the question.
With this assumption in mind, the retention rate is calculated by finding the users active in future months and dividing this number by the number of users in December 2020 or January 2021, depending on which retention rate you’re calculating.
For example, if the user were active in December 2020, it would appear in a table with a December 2020 timestamp. If there’s any future activity (in January 2021 or on), this user would be considered as retained for December 2020. If the user were active in December 2020 but didn’t appear in any of the coming months, it would be considered not retained.
Our solution will be based on the following assumptions:
- If a user is listed in the table, this represents the user’s activity for the date in the record.
- We consider only retention rates for Dec 2020 and January 2021.
- The table does not represent the list of all users but only the active users.
The steps you have to build into your code are:
- Find all active users in December 2020 by using the date field. Do the same for January 2021. That way, you’re getting denominators for the Dec and Jan retention rates.
- Find the maximum date of the user’s activity to see if the user has the activity in the future months. To do that, create a table with the user_id and max date.
- Join all the active users in the month with the list of users with future activity. That way, you’ll get the list of December 2020 users and their latest activity date. Then count the number of users with activity after Dec and divide it by the number of users in Dec to get the Dec retention rate. Apply the same principle to calculate the January retention rate. Mind the fact that for Jan retention rate, the future activity begins with February 2021.
- Consolidate by account_id. Use either Jan or Dec accounts list because it’s assumed that both months contain the complete list of account_ids.
The first thing is to find users active in December 2020.
WITH dec_2020 AS (SELECT DISTINCT account_id, user_id FROM sf_events WHERE EXTRACT(MONTH FROM date) = 12 AND EXTRACT(YEAR FROM date) = 2020 ),
To do that, we’re using the CTE. We’re interested in the distinct accounts and users, and to get the users active in December 2020, we’re using the EXTRACT() function in the WHERE clause.
The second CTE does the same thing for the users active in January 2021.
jan_2021 AS (SELECT DISTINCT account_id, user_id FROM sf_events WHERE EXTRACT(MONTH FROM date) = 1 AND EXTRACT(YEAR FROM date) = 2021 ),
Next we want to find the latest active date for each user.
max_date AS (SELECT user_id, MAX(Date) AS max_date FROM sf_events GROUP BY user_id),
As you could probably say from the solution breakdown, here we’ll use the MAX() function to find the latest active date.
Now comes the step where we calculate the retention rate. First the December 2020 retention rate.
retention_dec_2020 AS (SELECT account_id, SUM(CASE WHEN max_date > '2020-12-31' THEN 1.0 ELSE 0 END) / COUNT(*) * 100.0 AS retention_dec FROM dec_2020 JOIN max_date ON dec_2020.user_id = max_date.user_id GROUP BY account_id),
Here, we joined the two CTEs together to match active users in Dec with users that have had future activity. It’s possible that the future activity is in December. Because of that, we’ll only count the users that had activity after Dec.
We used the CASE WHEN statements to allocate values of 1 to all users that had activity after December 2020. Sum these values, divide them by the total number of users in December 2020, and you get the Dec retention rate.
Then we do the same for January 2021 retention rate.
retention_jan_2021 AS (SELECT account_id, SUM(CASE WHEN max_date > '2021-01-31' THEN 1.0 ELSE 0 END) / COUNT(*) * 100.0 AS retention_jan FROM jan_2021 JOIN max_date ON jan_2021.user_id = max_date.user_id GROUP BY account_id)
Now that we have the retention rate for Dec and Jan active users, we only need to group by account_id and divide the retentions.
SELECT retention_jan_2021.account_id, retention_jan / retention_dec AS retention FROM retention_jan_2021 INNER JOIN retention_dec_2020 ON retention_jan_2021.account_id = retention_dec_2020.account_id
The complete answer to this question is:
WITH dec_2020 AS (SELECT DISTINCT account_id, user_id FROM sf_events WHERE EXTRACT(MONTH FROM date) = 12 AND EXTRACT(YEAR FROM date) = 2020 ), jan_2021 AS (SELECT DISTINCT account_id, user_id FROM sf_events WHERE EXTRACT(MONTH FROM date) = 1 AND EXTRACT(YEAR FROM date) = 2021 ), max_date AS (SELECT user_id, MAX(Date) AS max_date FROM sf_events GROUP BY user_id), retention_dec_2020 AS (SELECT account_id, SUM(CASE WHEN max_date > '2020-12-31' THEN 1.0 ELSE 0 END) / COUNT(*) * 100.0 AS retention_dec FROM dec_2020 JOIN max_date ON dec_2020.user_id = max_date.user_id GROUP BY account_id), retention_jan_2021 AS (SELECT account_id, SUM(CASE WHEN max_date > '2021-01-31' THEN 1.0 ELSE 0 END) / COUNT(*) * 100.0 AS retention_jan FROM jan_2021 JOIN max_date ON jan_2021.user_id = max_date.user_id GROUP BY account_id) SELECT retention_jan_2021.account_id, retention_jan / retention_dec AS retention FROM retention_jan_2021 INNER JOIN retention_dec_2020 ON retention_jan_2021.account_id = retention_dec_2020.account_id
Edge Case Consideration
As an edge case, we’ll consider the possibility that not all accounts were present each month.
To compensate for that and to include all accounts, you can use two workarounds.
FULL OUTER JOIN
The first workaround is to use the FULL OUTER JOIN instead of INNER JOIN in the SELECT statement referencing the CTEs.
SELECT COALESCE(retention_jan_2021.account_id, retention_dec_2020.account_id) AS account_id, COALESCE(retention_jan, NULL) / COALESCE(retention_dec, NULL) AS retention FROM retention_jan_2021 FULL OUTER JOIN retention_dec_2020 ON retention_jan_2021.account_id = retention_dec_2020.account_id
Use the COALESCE function to get the January accounts and the December accounts not appearing in January. Then use the same function to divide the two retention rates, with NULL when there’s no retention rate for that account.The CTEs calculating the retention rates are joined using the FULL OUTER JOIN. If you don’t feel at home with all these different JOINs and what they do, don’t worry! Here’s an article “How to Join 3 or More Tables in SQL” that explains everything about the JOINs you need to know.
The issue with this edge case solution is that it’s computationally intensive.
There’s another way. You can get a complete list of all accounts by using UNION, like this:
all_accounts AS (SELECT account_id FROM retention_jan_2021 UNION SELECT account_id FROM retention_dec_2020) SELECT a.account_id, COALESCE(retention_jan, NULL) / COALESCE(retention_dec, NULL) AS retention FROM all_accounts a LEFT JOIN retention_jan_2021 j ON a.account_id = j.account_id LEFT JOIN retention_dec_2020 d ON a.account_id = d.account_id
Both these workarounds have a downside, which is they only capture the accounts that are in December and January. This means they don’t consider all months in the dataset.
If you want all months, you can simply create a table with all the distinct account IDs found in the table. This would, however, mean listing all the accounts for all time, so you may get a lot of accounts with retention being zero because they don’t have any users.
This Salesforce data science interview question is not easy. But if you hung on in there until the end, you have gotten really valuable knowledge. That is calculating the retention rates.
Knowing that will not only get you a bigger chance of success at the job interview. It will also make you a valuable asset to a company, because you’ve shown that you possess a high level of business, as well as technical knowledge. If you want to practice more questions from Salesforce, check out our previous post “Salesforce Data Scientist Coding Interview Questions” or you can also find questions from other top companies here “SQL Interview Questions You Must Prepare: The Ultimate Guide”.