Salesforce Data Scientist Coding Interview Questions
In this article, we will cover the most important Data Scientist coding interview questions you may expect to be asked at Salesforce. If you are preparing to interview for a data scientist position, you must know these questions.
Salesforce is a software company and the provider of a customer relationship management (CRM) platform. It offers a wide pallet of cloud-based solutions to improve and streamline the cooperation between companies and their customers. Salesforce is a global company with headquarters in San Francisco and offices in 29 countries worldwide. In 2021, based on surveys completed by its employees, Salesforce ranked second on Fortune magazine’s list ‘100 Best Companies to Work For’.
The Data Scientist Role at Salesforce
Salesforce employs many data scientists and data engineers spread across many locations and teams, each being usually responsible for an individual product. While the specifics of each position vary slightly depending on the team, many data specialists work on problems like forecasting business metrics, modeling client base and engagement, and creating meaningful customer experiences by performing classification and clustering on structured and unstructured data. Additionally, they may develop end-to-end data products, review and update existing data or build and maintain machine learning pipelines.
The usual stack expected from a data scientist or data engineer at Salesforce includes SQL (through PostgreSQL), Python (with Pandas and Sklearn) and Spark. Additionally, a candidate should have knowledge in machine learning, project lifecycle, performance monitoring and software testing. Salesforce also extensively uses technologies such as Docker, AWS or Airflow.
Technical Concepts Tested in Salesforce Data Scientist Coding Interview Questions
This article discusses some of the most recent Salesforce Data Scientist interview questions that candidates have been asked in SQL. Traditionally, the SQL interview questions at Salesforce were rather diverse and covered topics such as comparing individual data points with averages (AVG), counting occurences per timeframe (COUNT, EXTRACT), finding data based on criteria (HAVING, MAX/MIN, greater/lower than etc.), sorting (ORDER BY), finding duplicates (DISTINCT) or connecting tables (JOIN).
However, most recently the majority of candidates face a variety of interview questions based on the same simple table. The Salesforce data scientist interview questions relate to analysing changes of the user engagement in time. In this article, we will analyse this recent dataset and discuss the solutions to the questions you may expect to get asked.
Salesforce Data Scientist Interview Question #1: User Growth Rate
We are being asked to find the growth rate of active users for December 2020 to January 2021 for each account. We are informed that the growth rate is defined as the number of users in January 2021 divided by the number of users in December 2020. We should write an SQL query to output only the account_id and growth rate.
You can see this Salesforce data scientist interview question and practice solving it yourself here: https://platform.stratascratch.com/coding/2052-user-growth-rate?python=
To answer the question, we are provided with a simple data table ‘sf_events’.
Here is a sample of what the data look like:
As we can see, there are only 3 columns. The column ‘date’ is of type ‘datetime’ and it denotes when a certain user is active. The other columns, ‘account_id’ and ‘user_id’ have type ‘varchar’ and allow to uniquely identify a user of a certain account. In this example, there are 3 accounts (A1, A2 and A3) and each may have several users.
To answer this Salesforce data scientist interview question we need to manipulate the data in this table with the end goal of calculating the growth rate for each account. The solution has 3 major steps:
- For each month, create a new table with the number of users from each account that were active in the given month;
- Combine the tables of all the months using INNER JOIN;
- For each account, divide the number of users from January by the number of December (growth rate).
In this particular case, we are interested in 2 months: December 2020 and January 2021. Let’s start solving the problem by looking only at user engagement from December 2020. We want to return the number of users who were active in this month for each account. Therefore, we can start by writing:
SELECT account_id, count(DISTINCT user_id) AS dec_count FROM sf_events GROUP BY account_id
This will return two columns: ‘account_id’ and the number of users from that account who were active, which we call ‘dec_count’. Because of the GROUP BY statement, each row will contain a distinct ‘account_id’. Additionally, it’s crucial to use the DISTINCT statement inside the COUNT statement because we are only interested in the number of users who were active, not the number of times they were active. However, the query from above will return the total number of users from all the months, we are, on the other hand, only interested in December 2020. Thus, it is necessary to add a constraint using WHERE and exploiting the fact that column ‘date’ has the type ‘datetime’:
WHERE EXTRACT(MONTH FROM date) = 12 AND EXTRACT(YEAR FROM date) = 2020
Including the code from above will successfully filter the user engagements by date and will return the values relevant for December 2020. To be able to reuse this query later, we can treat it as a sub-query and call it ‘dec_2020’. Furthermore, we can write a similar code for counting the number of users from January 2021 and call it ‘jan_2021’. The key difference is in the WHERE statement - the numbers representing the month and year need to be replaced.
dec_2020 AS (SELECT account_id, count(DISTINCT user_id) AS dec_count FROM sf_events WHERE EXTRACT(MONTH FROM date) = 12 AND EXTRACT(YEAR FROM date) = 2020 GROUP BY account_id), jan_2021 AS (SELECT account_id, count(DISTINCT user_id) AS jan_count FROM sf_events WHERE EXTRACT(MONTH FROM date) = 1 AND EXTRACT(YEAR FROM date) = 2021 GROUP BY account_id)
Having this, we are able to calculate the growth rate. To achieve this, we first need to combine the tables resulting from the sub-queries ‘dec_2020’ and ‘jan_2021’. We will use INNER JOIN to only include the accounts that had any user activity in both months, otherwise there is no growth rate to calculate.
It’s also important to look at the interview question again to see in what format should the results be returned, in this case: ‘Output the account_id and growth rate’. Hence, we will be selecting ‘account_id’ and then calculating growth rate as follows: jan_count / dec_count. Because of using INNER JOIN, we can select column ‘account_id’ from any of the 2 sub-queries. As to the growth rate, let’s make sure to return it as a float. It is also a good habit to give a name to this new column using the AS statement.
SELECT jan_2021.account_id, jan_count / dec_count::float AS growth_rate FROM jan_2021 INNER JOIN dec_2020 ON jan_2021.account_id = dec_2020.account_id
We have now covered all the steps required to construct a solution to this data science coding interview question from Salesforce. These steps still need to be combined to create the final query that produces the expected output. A simple way to combine the two sub-queries ‘dec_2020’ and ‘jan_2021’ with the query from above is using the WITH statement. The final solution looks as follows:
WITH dec_2020 AS (SELECT account_id, count(DISTINCT user_id) AS dec_count FROM sf_events WHERE EXTRACT(MONTH FROM date) = 12 AND EXTRACT(YEAR FROM date) = 2020 GROUP BY account_id), jan_2021 AS (SELECT account_id, count(DISTINCT user_id) AS jan_count FROM sf_events WHERE EXTRACT(MONTH FROM date) = 1 AND EXTRACT(YEAR FROM date) = 2021 GROUP BY account_id) SELECT jan_2021.account_id, jan_count / dec_count::float AS growth_rate FROM jan_2021 INNER JOIN dec_2020 ON jan_2021.account_id = dec_2020.account_id
Note that there are multiple ways to write this query. You can now think of your own code and submit it at our platform to receive feedback from other users. Make sure to also check the solutions submitted by others for further inspiration.
Salesforce Data Scientist Interview Question #2: Consecutive days
Let’s take a look at another Salesforce data scientist interview question that used the same dataset but posed a different challenge. In this case, we are being asked to find all the users who were active for 3 consecutive days or more. We are given the exact same data table as in question #1.
Link to this Salesforce data scientist interview question: https://platform.stratascratch.com/coding/2054-consecutive-days?python=
Even though this interview question is really short, the task is, in fact, more tricky than in the previous case. And while it is possible to solve it using a rather short query, the hard part is figuring out how to extract the consecutive dates from a sorted table. This task alone requires the knowledge of the more advanced SQL functions and statements. The solution should follow these general steps:
- Query only ‘user_id’ and ‘date’ columns and sort both in the ascending order;
- Rank the data by date for each user separately using the ROW_NUMBER() function
- Filter the results by only leaving the consecutive dates by subtracting the rank from the event date;
- Select only these distinct user IDs that appear in the filtered results at least 3 times.
Salesforce Data Scientist Interview Question #3: Retention rate
This is another question where we are given the same dataset as in question #1 and question #2. However, this time we are being asked to find the month to month retention rate of users for each account separately for Dec 2020 to Jan 2021. We are being informed that the retention rate is the percentage of users a business retains over a given period of time. We can assume the user is retained if he stays with the app in any future months. Furthermore, we are given an example: if the user was active in Dec 2020 and also in the following month (Jan 2021) or any other month in the future, he is retained for Dec 2020. Finally, we are informed that the output should have account_id and the Jan 2021 retention rate divided by Dec 2020 retention rate.
Link to this Salesforce data scientist interview question: https://platform.stratascratch.com/coding/2053-retention-rate?python=
The solution to this question does not require a very high level of SQL, however, it requires the knowledge of using the CASE statement. The final query will also be relatively long in this case because there are multiple sub-queries that need to be performed before outputting the final results. Here are the steps that the solution should follow:
- For each month, query the users and their account that were active in a given month; This is very similar to the first step from question #1;
- For each user, check what is the last date when they were active;
- Calculate retention rate for each month: count the users of each account that were active in the given month and for which the last date when they were active occurs after this month;
- Express the monthly retention rate as percentage by dividing the number of retained users by all users in an account and multiplying by 100;
- Adjust the output as outlined in the question: for each account divide the January 2021 retention rate by December 2020 retention rate.
Salesforce Data Scientist Interview Question #4: Total Sales in Different Currencies
Even though the majority of the most recent data scientist interview questions at Salesforce focus on the simple data table that we saw in the previous questions, we have received reports of one more dataset used to construct interview tasks.
Let’s consider this question, it assumes that you work for a multinational company that wants to calculate total sales across all the countries they do business in. There are 2 tables, one is a record of sales for all countries and currencies the company deals with, and the other holds currency exchange rate information. The task is to calculate the total sales, per quarter, for the first 2 quarters in 2020, and report the sales in USD currency.
Link to this question: https://platform.stratascratch.com/coding/2041-total-sales-in-different-currencies?python=
As mentioned in the question, there are two tables with data this time. The one holding currency exchange rate information is called ‘sf_echange_rate’ and has 4 columns, while the table with records of sales has a name ‘sf_sales_amount’ and 3 columns. See the data types and the samples of data:
While the solution to this question is not particularly difficult, it is crucial to correctly join the two tables in such a way that the exchange rate date and the sale date match. To achieve this, and to report the values per quarter, it is useful to know the DATE_PART() function that can easily extract the important information from a datetime type. The solution should then follow these steps:
- Join the two tables such that ‘source_currency’ matches ‘currency’ but also the month and year in ‘date’ match the month and year in ‘sales_date’;
- Group the dates by quarter and, for each one, sum the ‘sales_amount’ multiplied by the correct exchange rate;
- Filter the results by leaving only the first 2 quarters of 2020.
We hope that this article, by focusing on the most recent data scientist interview questions used at Salesforce, has been helpful for preparing for the interview. If you struggled with solving the tasks, feel free to browse all the SQL questions from Salesforce interviews on our platform. Use the ‘Approach Hints’ functionality to display the general steps for solving the problems and discuss your code with other users to get more insights and feedback. You can also find the complete queries for all the questions described in this article. After practicing using the real SQL questions from Salesforce, you should be properly equipped to face the technical questions at your interview!