Noom SQL Interview Questions for Data Science
A walkthrough of one of the SQL interview questions for data science from Noom.
This is one of the medium level SQL interview questions for data science taken from Noom that tests your ability to filter out data based on date conditionals. This question as a whole will test your knowledge in data wrangling (fetching data from different tables), aggregation, grouping, and sorting.
Let's walk through solving this question as we do in the interview.
Noom SQL Interview Question for Data Science
Transactions By Billing Method and Signup ID
Get list of signups which have a transaction start date earlier than 10 months ago from today. For all of those users get the average transaction value and group it by the billing cycle.
Your output should include the billing cycle, signup_id of the user, and average transaction amount. Sort your results by billing cycle in reverse alphabetical order and signup_id in ascending order.
Link to this Noom data science SQL interview question: https://platform.stratascratch.com/coding-question?id=2031&python=
3-Step Framework to Solve such SQL Interview Questions for Data Science
There is always a systematic way to approach and solve a data science question, whether on the job or in an interview. Let's split this framework into steps with the explanation.
- Understand Your Data:
- List your assumptions about the data columns so you know which columns to use
- If you still don’t feel confident you understand your data enough, view first couple of rows of your data (single or multiple tables). Or if on an interview, ask for some example values so you understand the actual data, not just the columns. It’ll help you identify edge cases and limit your solution to the bounds of your assumption.
- Formulate Your Approach:
- Write down the logical steps you are supposed to program/code.
- Identify the main functions you would use/implement to perform the logic.
- Interviewers will be watching you; they will intervene when needed, make sure you ask them to clarify any ambiguity, they will also specify if you can use ready-made functions, or you should write code from scratch.
- Code Execution:
- Build up your code, do not oversimplify, do not overcomplicate it either.
- I like to build it in steps based on the steps I’ve outlined with the interviewer. That means that the code is probably not going to be efficient. That’s fine. You can talk about optimization at the end with the interviewer.
- The most important point is not to overcomplicate your code with multiple logical statements and rules in each code block. A block of code can be defined as a CTE or a subquery because it’s self-contained and separate from the rest of the code.
- Speak up and talk as you’re laying down code as the interviewer will be evaluating your problem-solving skills.
- signup_id is the column to join signups & transactions tables.
- plan_id from signups tables & id from plans table to join them both.
- From signups table we will use:
- signup_id column to join
- signup_id is all we need to identify users
- plan_id column to join
- From transactions table we will use:
- transaction_start_date indicates when the user initiated a transaction so we’ll use it to filter on dates
- amt to calculate our average transaction
- signup_id column to join
- From plans table we will use:
- id column to join
- billing_cycle so that we can group by the cycle
- Users will have multiple entries in plans tables as they can switch from one plan to another, or have their plans renewed each billing cycle.
- Users will have multiple entries in transactions tables as they can purchase any time.
We’ll write out the approach before starting to code.
- Join the 3 tables together based on signup_id, plan_id.
- Filter the data on transactions that were made only 10 months ago from today using the transaction_start_date column.
- Subtract 10 months from today’s date using now() -10 * interval '1 month'
- Calculate average amount of all transactions per user per billing cycle.
- Sort data as requested.
1. Join the 3 tables together:
We will use an inner join, the trick here is the table you will use to start the joining process with, since we are concerned mainly about transactions, then we only want to get records for users who have logs in the transaction table. We use an inner join because we only care about users with transactions, signups, and plans.
SELECT * FROM transactions t JOIN signups s ON t.signup_id = s.signup_id JOIN plans p ON s.plan_id = p.id
2. Filter the grouped data on transactions that were made only 10 months ago from today:
SELECT * FROM transactions t JOIN signups s ON t.signup_id = s.signup_id JOIN plans p ON s.plan_id = p.id WHERE transaction_start_date < now() - 10 * interval '1 month'
3. Calculate average amount of all transactions per user per billing cycle:
SELECT billing_cycle, s.signup_id, avg(amt) as avg_amt FROM transactions t JOIN signups s ON t.signup_id = s.signup_id JOIN plans p ON s.plan_id = p.id WHERE transaction_start_date < now() - 10 * interval '1 month' GROUP BY billing_cycle, s.signup_id
We'll specify only the columns we want to show in our output, then add aggregate function.
All aggregate functions demand that we group by all other columns to be shown in our output (signup_id & billing_cycle).
To calculate the average, we directly use a built-in function AVG().
4. Sort data as requested:
SELECT billing_cycle, s.signup_id, avg(amt) as avg_amt FROM transactions t JOIN signups s ON t.signup_id = s.signup_id JOIN plans p ON s.plan_id = p.id WHERE transaction_start_date < now() - 10 * interval '1 month' GROUP BY billing_cycle, s.signup_id ORDER BY billing_cycle DESC, s.signup_id ASC
A question that often gets asked is if there’s a way to optimize the code. The interviewer is testing your theory on SQL so you should say something even if there’s no way to optimize the code.
Taking a look at our solution of this Noom SQL interview question for data science, there’s no way to optimize this code any further.
Sometimes you can remove a JOIN by using a case statement but it won’t work in our approach because we need to identify the date difference across the entire dataset.
You could separate out the date logic into its own subquery and CTE before performing the JOIN but that doesn’t speed things up in this case. At least not considerably.
Even if there’s no way to optimize the code but you still dropped some knowledge on SQL, it would pass the interviewer’s assessment of whether or not you know SQL theory.
Find more Noom interview questions for the data scientist position here!
This was one of the medium level SQL interview questions for data science, not due to the fact that you’re trying to find data split across 3 tables but mainly because you’re using an advanced SQL function to extract date components like the month from a date field, which make the question more complicated and using the now() function to create a dynamic solution based on today’s date. But it’s necessary to learn how to manipulate dates in data science since most analyses have a date component.
The trick to this Noom SQL interview question for data science is using a framework to organize your thoughts as there are multiple steps. Once you layout your assumptions, like understanding your data well, filtration criteria, the question becomes much easier to solve. All you need to do is organize your approach in logical and concise steps, and code it up.
Practice the framework and practice organizing your thoughts before coding, and answering complicated questions will become much easier.