Amazon Data Analyst Interview Questions

Amazon Data Analyst Interview Questions


One of the interesting but difficult interview questions that Amazon recently asked for the data analyst position.

This is one of the very interesting but quite difficult data analyst interview questions. It deals with finding streaks in sport results. We’re going to solve the question using SQL and also provide you with some tips on how to approach it and how to effectively communicate with your interviewer.

Amazon Data Analyst Interview Question

Player with Longest Streak


Table: players_results

Link to the question: https://platform.stratascratch.com/coding/2059-player-with-longest-streak

This question has been repeatedly asked at Amazon interviews for data scientists and data analysts. The title of the question is ‘player with the longest streak’ and that’s exactly what we need to find. We are given a table of tennis players and their matches that they could either win (W) or lose (L). The task is to find the longest streak of wins.

This Amazon data analyst interview question also explains that a streak is a set of consecutive won matches of one player and that the streak ends once a player loses their next match. We are also being asked to output the ID of the player or players and the length of the streak.

This is one of the hard Amazon data analyst interview questions because the task of identifying patterns, especially in time, is difficult in SQL. Something that can easily be solved in other programming languages using loops in SQL requires an elaborate and clever approach. Because of that, while it’s not common to use SQL in practice to detect streaks, this makes a really good interview question that companies such as Amazon like to ask.

Framework to Solve this Amazon Data Analyst Interview Question

Before devising the solution, let us share the 3-step general framework to solve any data analyst interview questions with you. It’s especially useful for solving interview tasks as, given the stress, it’s lifesaving to have a systematic way to approach any interview question, but it also comes useful in general e.g. in your career as a data analyst.

  1. Understand your data:
    1. The first step of this framework is to list your assumptions about the data columns. It helps you to know which columns to use.
    2. If you still don’t feel confident that you understand the data enough, view the first couple of rows of the data (single or multiple tables). Or if in an interview, you can ask for some example values to understand the actual data, not just the columns. It helps you identify the edge cases and limit your solution to the bounds of your assumption.
  2. Formulate your approach:
    1. Next is to write down the logical steps you are going to program or code.
    2. Identify the main functions you would use or implement to perform the logic.
    3. Interviewers will be watching you; they can intervene when needed, so make sure you ask them to clarify any ambiguity, they will also specify if you can use ready-made functions, or you should write code from scratch.
  3. Code Execution:
    1. While building up your code, do not oversimplify, do not overcomplicate it either.
    2. We may build it in steps based on the steps we’ve outlined with the interviewer. That means that the code will probably not be efficient. That’s fine. You can talk about optimization at the end with your interviewer.
    3. The most important point is not to overcomplicate your code with multiple logical statements and rules in each code block. A block of code can be defined as a CTE or a subquery because it’s self-contained and separate from the rest of the code.
    4. Speak up and talk when you’re laying down the code as the interviewer will always be evaluating your problem-solving skills.

Understand Data

Understanding Data of Amazon Data Analyst Interview Questions

Let’s start by looking at the data we have available to solve this Amazon data analyst interview question. Usually, at a data analytics interview, you won’t be given any actual records, instead, you’ll see what tables or data frames are there and what are the columns and data types in these tables. However, in this case, the question already provides us with a lot of information about how the data looks like. It says that we are given a table of tennis players and their matches while the result of a match, either a win or loss, is stored as a letter ‘W’ or ‘L’.

players_results

Table: players_results
player_idmatch_datematch_result
4012021-05-04W
4012021-05-09L
4012021-05-16L
4012021-05-18W
4012021-05-22L

And looking at the dataset that we’re given, it all checks out. There is only one table called ‘player_results’ and it only has 3 columns. There is a numeric ID of a player, the date of a match and the result. From this, we can deduce that each row represents a result of a certain player that they obtained in a match that took place on a certain date.

Formulate Approach

The next task is to formulate the approach or set out a couple of general steps that we will follow in order to solve this Amazon data analyst interview question. Choosing the first step is not that obvious but probably the best way to start is by finding a way to differentiate between different streaks. In other way, how to say that a certain won match is a part of one streak of wins and not the other? One way to do it is by assigning to each result the number of times a player had a different result earlier.

-- for each match result, count the number of times a player had a different 
   result earlier

What does it mean exactly and how is it helpful? Consider this example, let’s have a player who first won two matches, then lost one match and finally won another two matches. So we have 5 matches with two streaks of length 2 divided by a single loss. Initially, there is no easy way to tell that the matches number 2 and 4 are from different streaks, they’re both marked with a letter ‘W’ in the dataset. But now, let’s assign this number of previous different results to each match. For the first two matches, the player hadn’t lost before, so they will both get a number 0. Then for the third match that was lost, we see that this player previously won two matches, so this loss gets a number 2. Finally, before the final two games, the player lost 1 time in total, so they both get the number 1. Now, the entire first streak is where the number is equal to 0 and the second streak is where the number is 1. And it doesn’t really matter what these numbers really are, what matters is that matches from the same streak will always have the same numbers and matches from different streaks will always have different numbers.

Once we manage to find a way to represent streaks in a way that is easy to interpret in SQL, the rest is relatively simple. The next step can be to use these numbers that we assigned and use the COUNT() function while aggregating the results by streak using these numbers. In this way, we will obtain the number of games in each streak, or in other words, the length of each streak.

-- aggregate by streak and count number of rows

At this point, we will know the length of each streak of either wins or losses for each player. But let’s remember that in the context of this question, a streak is actually defined as a set of consecutive won matches. So in the third step, let’s filter the results only leaving the streaks of won matches.

-- filter the data to only output winning streaks

Once we have this list of all winning streaks and their lengths, it is quite a straightforward task to find players who have the longest streak. One approach could be to order the results by the streak length in the descending order and use a LIMIT function to display the top row. But what if there is more than one longest streak? Our solution should account for ties in the streak lengths. There are many better ways to select a maximum from the dataset, in this problem let’s use the RANK function. Then the fourth step will be to assign ranks to the streaks based on their length.

-- rank the streak from longest to shortest

And the final thing to do will be to output the ID of the player or players who have the longest streak, in other words, the rows where the rank is equal to 1.

- output the players with longest streaks

Code Execution

Now, having defined the general steps for solving this Amazon data analyst interview question, we can try to translate them into code to create the actual solution using SQL. As you may expect, turning the first step into the code will be the most difficult but once we get there, the rest will be pretty straightforward. So let’s jump right into it and start with this first step.

We want to, for each match result, count the number of times a player had a different result earlier. We will start by writing a very generic query, simply selecting all columns from the players_results table. We will also assign an alias to the table, let’s say ‘curr_result’ for the current result. That’s because we will use rows from this table and, in a moment, compare them with the previous results.

SELECT player_id,
       match_date,
       match_result
FROM players_results curr_results

This is a very simple query that only returns the original table. But now we can use it to add something to each row. And this something will be the number of other results from before. To add this number, we will use a subquery located in the SELECT clause of the main query. This construction, for now without any conditions, will look like this:

SELECT player_id,
       match_date,
       match_result,
  (SELECT COUNT(*)
   FROM players_results prev_results)
FROM players_results curr_results

Look how we assigned the alias ‘prev_results’ to the table in the inner query. This will be the column in which we will be counting the rows, nothing else. And this is of course the same table as in the main query, but because of the two different aliases, we will be able to use it as two separate tables.

As you can see, this subquery added a new column with a number. For now the number is always the same and is equal to the overall number of rows in the table. To actually make it into the number of other previous results, we will need to add some conditions. Firstly, we can say that for each result, it should count the number of results of the same player. So I’ll add a WHERE clause saying that player_id in curr_results should be equal to player_id in prev_results.

SELECT player_id,
       match_date,
       match_result,
  (SELECT COUNT(*)
   FROM players_results prev_results
   WHERE curr_results.player_id = prev_results.player_id)
FROM players_results curr_results

The numbers have changed, now it’s either 10 or 9 because each player apparently has either 10 or 9 games in this dataset. This query counts all the games of each player but remember that we only want to consider the previous games, from the perspective of each game. So let’s add another condition, this time saying that match_date of the game in prev_results should be earlier than or the same as the match_date of the game in curr_results.

SELECT player_id,
       match_date,
       match_result,
  (SELECT COUNT(*)
   FROM players_results prev_results
   WHERE curr_results.player_id = prev_results.player_id
     AND prev_results.match_date <= curr_results.match_date)
FROM players_results curr_results

But why are we using the ‘smaller or equal’ operator and not just ‘smaller than’? We’ll explain it in a moment, but first, let’s add the third and final condition. Right now this query simply counts, for each game, which game in a row is it for a certain player. The number therefore goes from 1 to either 10 or 9. But remember that we wanted to include one more condition. We wanted to count all games that happened previously and where the result was different than in the current game. So let’s code it by saying that match_result in prev_results should be different from match_result in curr_results.

SELECT player_id,
       match_date,
       match_result,
  (SELECT COUNT(*)
   FROM players_results prev_results
   WHERE curr_results.player_id = prev_results.player_id
     AND prev_results.match_date <= curr_results.match_date
     AND prev_results.match_result <> curr_results.match_result)
FROM players_results curr_results

Now the numbers look exactly like we envisioned them. We’re counting the number of times each player had a different result before each game, so it always starts at 0 and then continues based on how many streaks there are. But what’s most important, our goal of separating the streaks is achieved. Both for the winning and losing streaks, see how games that are in the same streaks have the same number and games from different streaks have different numbers. And yes, sometimes winning and losing streaks have the same number, but since the result is different, it’s still easy to differentiate between them and consider them different streaks.

And why did we use the smaller than or equal to sign when comparing dates? Let’s say a player has only three games, all on the same day, one is won, the second one is lost and the third one is lost again. If we only counted the games that happened on previous days, then all games would receive number 0 and we wouldn’t detect the two different winning streaks. By allowing the date to be the same but at the same time adding the condition that the results must be different, the numbers will be correct.

Now, we will do two small things that were not explicitly mentioned in the general framework. We will name this new column that we have added as ‘streak_id’ so that we can easily recognize it in the future. And another thing is that we still need to do a number of things using these results. So in order not to have multiple levels of subqueries, we will store this whole query as a Common Table Expression, or a CTE so that in the next queries, we can only use a keyword ‘cte’ to get access to the results of this query.

WITH cte AS
  (SELECT player_id,
          match_date,
          match_result,
     (SELECT COUNT(*)
      FROM players_results prev_results
      WHERE curr_results.player_id = prev_results.player_id
        AND prev_results.match_date <= curr_results.match_date
        AND prev_results.match_result <> curr_results.match_result)AS streak_id
   FROM players_results curr_results)
SELECT *
FROM cte

The second step is to aggregate by streak and count the number of rows. We can use the new column ‘streak_id’ to aggregate the results and, in this way, count the number of games in each streak. But remember that ‘streak_id’ is not unique, there may be streaks with the same number assigned to them but with different results or from different players. So it’s important that we aggregate the result both by the player_id, match_result and streak_id.

WITH cte AS
  (SELECT player_id,
          match_date,
          match_result,
     (SELECT COUNT(*)
      FROM players_results prev_results
      WHERE curr_results.player_id = prev_results.player_id
        AND prev_results.match_date <= curr_results.match_date
        AND prev_results.match_result <> curr_results.match_result) AS streak_id
   FROM players_results curr_results)
SELECT player_id,
       match_result,
       streak_id,
       count(*) AS streak_length
FROM cte
GROUP BY player_id,
         match_result,
         streak_id
ORDER BY player_id,
         streak_id

Now we have the length for each streak, but the results include both the winning and losing streaks. If we selected the longest streak now, it would belong to the plate with ID 405, but see that this is, in fact, a losing streak. That’s why we have the third step which is about filtering the data to only output winning streaks. Let’s add this filter to our code by adding a WHERE clause in the main query.

WITH cte AS
  (SELECT player_id,
          match_date,
          match_result,
     (SELECT COUNT(*)
      FROM players_results prev_results
      WHERE curr_results.player_id = prev_results.player_id
        AND prev_results.match_date <= curr_results.match_date
        AND prev_results.match_result <> curr_results.match_result) AS streak_id
   FROM players_results curr_results)
SELECT player_id,
       match_result,
       streak_id,
       count(*) AS streak_length
FROM cte
WHERE match_result = 'W'
GROUP BY player_id,
         match_result,
         streak_id
ORDER BY player_id,
         streak_id

It is much better, we have only the winning streaks left, exactly as this Amazon data analyst interview question specifies. Now, we can finally move on to selecting the player with the longest streak. You can probably already see which player, or should we say players, will it be. The next step is to rank the streak from longest to shortest so we will use the SQL window function RANK() to add a new column with this ranking.

WITH cte AS
  (SELECT player_id,
          match_date,
          match_result,
     (SELECT COUNT(*)
      FROM players_results prev_results
      WHERE curr_results.player_id = prev_results.player_id
        AND prev_results.match_date <= curr_results.match_date
        AND prev_results.match_result <> curr_results.match_result) AS streak_id
   FROM players_results curr_results)
SELECT player_id,
       match_result,
       streak_id,
       count(*) AS streak_length,
       rank() OVER (
                    ORDER BY count(*) DESC) AS rnk
FROM cte
WHERE match_result = 'W'
GROUP BY player_id,
         match_result,
         streak_id
ORDER BY player_id,
         streak_id

We can verify that this new column has been added and that the longest streaks have been assigned the rank 1 and the shortest streaks, the ones with length 1, have all been assigned the last rank, in this case 7.

The final step is to output the players with the longest streaks. To do this, we need to select player IDs and the streak lengths but only from the rows where the rank is equal to 1. And since the column with the ranking is defined in this main query, we can’t make this selection in the same query. Instead, we will need to consider this current query as a subquery and write a new main query in which we select the player_id and streak_length columns and specify the condition that the rank must be equal to 1.

WITH cte AS
  (SELECT player_id,
          match_date,
          match_result,
     (SELECT COUNT(*)
      FROM players_results prev_results
      WHERE curr_results.player_id = prev_results.player_id
        AND prev_results.match_date <= curr_results.match_date
        AND prev_results.match_result <> curr_results.match_result) AS streak_id
   FROM players_results curr_results)

SELECT DISTINCT player_id,
                streak_length
FROM
  (SELECT player_id,
          match_result,
          streak_id,
          count(*) AS streak_length,
          rank() OVER (
                       ORDER BY count(*) DESC) AS rnk
   FROM cte
   WHERE match_result = 'W'
   GROUP BY player_id,
            match_result,
            streak_id
   ORDER BY player_id,
            streak_id) subquery
WHERE rnk = 1

What’s more, we added a keyword DISTINCT in the SELECT clause because with this approach, if one user would have two streaks of the same length, and both would happen to be the longest streaks in the dataset, then this plate’s ID would appear twice in the results. Adding a keyword DISTINCT would prevent it. This is also not something that happens in this particular case, so there won’t be any difference, but our solution should cover all possible edge cases.

Conclusion

There we have it, the correct solution to this Amazon data analyst interview question. It’s quite a long and convoluted solution but after careful analysis, it’s possible to understand what is done in each step. It’s also not a particularly difficult solution when it comes to syntax because no complicated functions or structures are used, but it still requires some clever idea to differentiate between different streaks. You can find advanced Amazon data scientist interview questions here.

We hope you enjoyed this explanation and learning how to deal with streaks in SQL. We showed you one possible solution to this problem, but there are also other approaches and solutions. Our advice is to practice answering the interview questions by constructing solutions to them but always try to think of other ways to solve them, maybe you’ll come up with a more efficient or more elaborate approach. But always remember to think about and cover all the possible edge cases that may appear in the data!

And lastly, don’t forget to stick to the above framework for solving any data analyst interview questions, it will make your interviews and your life much easier. Check out our previous post on Amazon SQL Interview Questions where we used the same framework to solve one of the interesting Amazon SQL interview questions. Once again, you start by examining and understanding the data, you then formulate your approach - the few general steps required to solve the problem. Finally and based on these two, you can write your code step by step.

Amazon Data Analyst Interview Questions


Become a data expert. Subscribe to our newsletter.