A Comprehensive Guide to CASE WHEN statements in SQL

CASE WHEN statements in SQL
Categories


Let’s explore the basic and advanced syntax of CASE WHEN statements in SQL and show how to use CASE statements to answer difficult interview questions in SQL.

Humans use the word ‘if’ to describe the desired outcome for a hypothetical scenario. This is a very useful logical concept because, in life, there are many variables.

Programming languages like SQL have adopted this logic because the query can also have variables. In SQL, we use the CASE expression to define if/then logic.

CASE expressions can have many uses in SQL. They are often used with GROUP BY and ORDER statements and sometimes paired with WHERE and HAVING to filter the table conditionally. CASE is very useful for conditionally aggregating values as well. You could say that CASE expression makes these SQL features even more useful.

What is CASE WHEN statement in SQL?

In SQL, the CASE expression defines multiple conditions and the value to be returned if each condition is met. Simply put, CASE expression allows you to write if/then logic in SQL.

Data isn’t always as organized or clear-cut as we’d like. We can use the CASE expression to handle unpredictable scenarios and conditionally determine the output of the query. It even lets you set the ELSE clause to specify the outcome if none of the conditions are met.

CASE can help you solve complex data science tasks elegantly. It is useful for conditionally selecting, aggregating, filtering, and ordering data.

Because of its many use cases, it’s a good idea to master writing robust CASE expressions before you go into an interview.

Let’s look at the basic syntax for writing CASE expressions in SQL.

Syntax to Write CASE expressions in SQL

Syntax to Write CASE WHEN statements in SQL

Let’s explore the basic structure of CASE expressions in SQL.

CASE and END keywords denote the beginning and end of the CASE expression. Between them, there must be at least one pair of WHEN and THEN blocks, which specify the condition and desired outcome if the condition is met.

Let’s look at an example:

CASE
WHEN condition THEN output
END

If the condition is found to be true, the CASE expression will return the corresponding output.

You can make conditions as simple or as complex as you’d like.

For example, use AND and OR logical operators to chain multiple conditions together.

CASE
WHEN condition_one AND condition_two THEN output
END

In this case, the CASE expression will return the output only if both conditions are true.

In the example, we have only one pair of WHEN/THEN blocks. Typically CASE expressions have multiple sets of WHEN/THEN that specify the condition and its outcome.

CASE
WHEN condition_one THEN output_one
WHEN condition_two THEN output_two
END

Final output of the CASE expression will be the output value of whichever condition is satisfied first.

You can add an optional (but very useful) ELSE clause to specify the return value if none of the conditions are met.

CASE
WHEN condition_one THEN output_one
WHEN condition_two THEN output_two
ELSE generic output
END

CASE expression will return ‘generic output’ (value specified by the ELSE clause in the example) if none of the two conditions are met.

If there is no ELSE, and none of the conditions are met, the final output will be NULL.

Use cases for the CASE expression

Use cases for the CASE expression in SQL

CASE expression can return values (text, number, date) as well as the reference to a column or even a condition.

CASE is most commonly used with SELECT, but it can be used to conditionally set parameters for ORDER BY, GROUP BY, HAVING, WHERE, and other SQL features. Or only aggregating values that meet the condition.

Knowing these use cases can help you easily solve complex questions during an interview. If you’re preparing for an interview, check out a long list of SQL Interview Questions on the StrataScratch blog. Also, take a look at SQL Query Interview Questions for an opportunity to practice writing SQL code.

Let’s see some use cases for the CASE expression, starting with most obvious.

CASE WHEN statement with SELECT

SELECT and CASE allow you to assess table data to create new values.

Let’s look at the example where we use CASE expression to create values for the new age_group column.

The new column will contain one of three strings: 'senior’, ‘middle age’, or ‘young’ depending on the existing value of the age column.

SELECT
*,
    CASE
        WHEN age < 30 THEN 'young'
	  WHEN 30 >= age  < 60 THEN 'middle age'
        ELSE 'senior'
    END AS age_group
FROM customers

Records where age value is under 30 will have the new age_group value of ‘young’. If the age value is between 30 and 60, CASE will return ‘middle age’, and if none of these conditions are met, the ELSE clause will return ‘senior’.

CASE WHEN statement with ORDER BY

CASE can also be paired with ORDER BY to specify the column for ordering rows in the table.

For example, if you had a database of athletes from many different sports, you might want to order athletes by height, if the athlete is a basketballer, and by weight, if the athlete is a wrestler.

SELECT 
       full_name,
       sport,
       height,
       weight
FROM athletes
ORDER BY 
CASE
  WHEN sport = 'basketball' THEN height
  WHEN sport = 'wrestling' THEN weight
END DESC

If you do use the DESC keyword to specify order direction, it must come after the CASE expression.

CASE WHEN statement with WHERE and HAVING

You can use CASE with filtering clauses to conditionally determine conditions for filtering.

Let’s imagine we have product data grouped by product category. We can use HAVING and CASE together to set up custom filters for each group.

SELECT
       SUM(sales),
       category
FROM sales
GROUP BY category
HAVING CASE
           WHEN category = 'furniture' THEN SUM(sales) > 100
           WHEN category = 'office' THEN SUM(sales) > 30
           ELSE SUM(sales) > 150
       END

The ‘furniture' group must have a minimum aggregate sales of 100, whereas ‘office’ products have a lower minimum of 50. All other product categories must have more than 150 sales to pass the condition of the HAVING clause.

CASE WHEN statement with GROUP BY

Normally, the GROUP BY statement specifies a column that contains values to be grouped. You can use the CASE expression to specify the column conditionally.

Let’s imagine we have international sales data and want to aggregate it. To keep things simple, we want to create a manageable number of groups.

If there are less than 100 rows,  group records by country column.  This way, we’ll have more specific groups. Otherwise, group them by values in the region column, which will result in more general groups.

Let’s see how we can use the CASE expression to do this:

SELECT
       country,
       region
       count(*)
FROM sales
GROUP BY 
     CASE
           WHEN count(*) < 100 THEN country
           ELSE region
     END

CASE WHEN statement with aggregate functions

CASE can be particularly useful when paired with aggregate functions.

As you may know, COUNT(column) counts values that are not NULL. On the other hand, CASE returns a value if the condition is satisfied and NULL otherwise. Therefore, we can pass CASE as an argument to COUNT() and calculate how many values in a column satisfy certain criteria.

SELECT
      price, 
	count(CASE WHEN price > 20 THEN product)
FROM products
GROUP BY price

In this example, COUNT() finds the number of products with a price value higher than 20.

To sum up, you can use CASE to make sure aggregate functions are only applied to records that satisfy criteria.

5 SQL Interview Questions on CASE Expressions

SQL Interview Questions on CASE WHEN statement

Let’s look at interview questions that require CASE expressions to find the answer.

Question 1: Olympic Medals By Chinese Athletes

First question comes from ESPN. Answering it will demonstrate how to couple CASE and SUM() aggregate functions to a great effect.


Table: olympics_athletes_events

Link to the question: https://platform.stratascratch.com/coding/9959-olympic-medals-by-chinese-athletes

Understand the question

We have a list of all athletes from many different nationalities. We have to get the number of each type of medal (silver, bronze, gold) won by Chinese athletes in the Olympics from 2000 to 2016.

The initial table is very general. We’ll have to filter, group, and aggregate values to get a specific answer - the number of medals won by Chinese athletes in particular years.

Analyze data

Candidates are given a table that contains information about all Olympic athletes going back to the 20th century.

Each athlete is described in detail:

  • Integer values in the id column uniquely identify athletes.
  • name, sex, age, height, weight columns describe athlete’s full name, sex, age, height and weight, respectively.
  • The team column refers to their national team.
  • The noc column specifies the national olympic committee.
  • games values refer to the official name of the Olympic event.
  • Values in year and city columns refer to time and place where the event took place.
  • sport, event and medal values refer to the sport practiced by the athlete, event where they participated and category of the medal they won (if any).

Let’s look at the available data to better understand it:

Table: olympics_athletes_events
idnamesexageheightweightteamnocgamesyearseasoncitysporteventmedal
3520Guillermo J. AmparanMMexicoMEX1924 Summer1924SummerParisAthleticsAthletics Men's 800 metres
35394Henry John FinchettMGreat BritainGBR1924 Summer1924SummerParisGymnasticsGymnastics Men's Rings
21918Georg Frederik Ahrensborg ClausenM28DenmarkDEN1924 Summer1924SummerParisCyclingCycling Men's Road Race Individual
110345Marinus Cornelis Dick SigmondM26NetherlandsNED1924 Summer1924SummerParisFootballFootball Men's Football
54193Thodore Tho JeitzM26LuxembourgLUX1924 Summer1924SummerParisGymnasticsGymnastics Men's Individual All-Around

Everything looks normal. Except the fact that some records have empty age, height, and weight fields.

Another important detail is that the medal field is empty if the athlete did not win any medals.

Plan your approach

We need to filter records so that there are only Chinese athletes that participated in 2000, 2004, 2008, 2012 and 2016 olympics.

Once we have filtered results, we need to create groups for each unique combination of year and medal category. Gold medals won in 2016 and silver medals won in 2016 would be two different groups.

The next step is to find the total number of medals for each year. We can use CASE with SUM() to only aggregate values that meet a criteria (medals won in a specific year). We can set up SUM() aggregate functions for every year from 2000 to 2016.

Finally, we will need to group records by quality and order them in a descending order, from highest medal count to the lowest.

Write the code

1. Filter the table

To solve this question, we only need records of Chinese athletes that participated in the 2000, 2004, 2008, 2012 and 2016 Olympic games. All other nationalities and all other competitions need to be filtered out.

We can use the OR logical operator to provide a list of multiple competition years that are ‘acceptable’. A much easier solution is to use the IN operator.

We also need to filter out athletes (regardless of their nationality) that did not win any medals. In the olympic_athletes_events table, medal value for athletes that did not win is NULL.

We use IS NOT NULL to identify athletes who won a medal, and therefore have a medal value that is not NULL.

SELECT
      year,
      medal,
      count(*) AS n_medals
FROM
      olympics_athletes_events
WHERE 
      team = 'China' AND 
      year IN (2000, 2004, 2008, 2012, 2016) AND 
      medal IS NOT null
GROUP BY 
      year, 
      medal

We create groups for every unique combination of year and medal values and use COUNT(*) to get the number of records in each group.


We still need to work with data from this query, so in the following steps, we are going to use it as a subquery.


2. Aggregate medals for each year

Our subquery outputs groups of year and medal (bronze, silver, gold) values, and corresponding number of medals for each group.

Next, we need to find the number of medals won in each year (2000, 2004, 2008, 2012 and 2016) and total number of medals won in all these years.

We can use SUM() aggregate function and CASE expression to only add up values that meet the criteria. In this case, we’ll write five different SUM() aggregate functions to add up medals won in a specific year, from 2000 to 2016.

Our final SUM() function should unconditionally add up medals in all years.

SELECT
    base.medal,
    sum(CASE WHEN year = 2000 THEN n_medals ELSE 0 END) AS medals_2000,
    sum(CASE WHEN year = 2004 THEN n_medals ELSE 0 END) AS medals_2004,
    sum(CASE WHEN year = 2008 THEN n_medals ELSE 0 END) AS medals_2008,
    sum(CASE WHEN year = 2012 THEN n_medals ELSE 0 END) AS medals_2012,
    sum(CASE WHEN year = 2016 THEN n_medals ELSE 0 END) AS medals_2016,
    sum(n_medals) AS total_medals
FROM
   (subquery)

3. Combine same medals into one group and order them

The question asks us to find the aggregate number of each type of medals won in specified years, so we need to combine records for bronze, silver and gold medals into one group.

SELECT
    base.medal,
    sum(CASE WHEN year = 2000 THEN n_medals ELSE 0 END) AS medals_2000,
    sum(CASE WHEN year = 2004 THEN n_medals ELSE 0 END) AS medals_2004,
    sum(CASE WHEN year = 2008 THEN n_medals ELSE 0 END) AS medals_2008,
    sum(CASE WHEN year = 2012 THEN n_medals ELSE 0 END) AS medals_2012,
    sum(CASE WHEN year = 2016 THEN n_medals ELSE 0 END) AS medals_2016,
    sum(n_medals) AS total_medals
FROM
   (SELECT
      year,
      medal,
      count(*) AS n_medals
    FROM
      olympics_athletes_events
    WHERE 
      team = 'China' AND 
      year IN (2000, 2004, 2008, 2012, 2016) AND 
      medal IS NOT null
    GROUP BY 
      year, 
      medal) AS base
GROUP BY
    base.medal
ORDER BY
    total_medals DESC

And finally, order groups by the total number of medals, from highest to lowest.


Output

The question should have the following output:

All required columns and the first 5 rows of the solution are shown

medalmedals_2000medals_2004medals_2008medals_2012medals_2016total_medals
Bronze000022
Gold000011

Question 2: Rows With Missing Values

In this question, we’ll use CASE expressions to do conditional filtering.


Table: user_flags

Link to the question: https://platform.stratascratch.com/coding/2106-rows-with-missing-values

Understand the question

To solve this question, we have to write a SQL query that returns records with more than one empty column. In other words, records that have more than one NULL value.

Analyze data

Table: user_flags
user_firstnameuser_lastnamevideo_idflag_id
RichardHassony6120QOlsfU0cazx3
MarkMayCt6BUPvE2sM1cn76u
GinaKormandQw4w9WgXcQ1i43zk
MarkMayCt6BUPvE2sM1n0vef
MarkMayjNQXAC9IVRw1sv6ib

We need to find the number of NULL values for each record. We are not interested in values, only whether or not they are NULL.

Plan your approach

In this question, we have to return rows that satisfy criteria - have more than one NULL value.

To return these values, we’ll need to use the SELECT statement with the FROM and WHERE clauses to only return rows that meet the condition - have more than one NULL value.

Setting up the condition is the most challenging part of this question. You can use CASE to count the number of NULL values for each record and return rows where that count is more than 1.

Write the code

1. SELECT data from the table

First, let’s select all rows from the user_flags table.

SELECT *
FROM user_flags

2. Filter the table

To filter tables in SQL, we need to use the WHERE clause. We need to make sure the number of NULL columns is more than one.

We’ll need four CASE expressions to check if the value in each column is NULL and return 1 if it is.

Finally, we’ll add the results of four CASE expressions and get the total number of NULL values.

SELECT *
FROM user_flags
WHERE (CASE
           WHEN user_firstname IS NULL THEN 1
           ELSE 0
       END) + (CASE
                   WHEN user_lastname IS NULL THEN 1
                   ELSE 0
               END) + (CASE
                           WHEN video_id IS NULL THEN 1
                           ELSE 0
                       END) + (CASE
                                   WHEN flag_id IS NULL THEN 1
                                   ELSE 0
                               END) > 1


For instance, if two of the four CASE expressions return 1, four CASE expressions will add up to 2.

Output

Our final answer needs to include all rows that have two or more missing values.

All required columns and the first 5 rows of the solution are shown

user_firstnameuser_lastnamevideo_idflag_id
CourtneydQw4w9WgXcQ
GinaKorman
Greg5qap5aO4i9A
Ct6BUPvE2sM

Question 3: Churn rate of Lyft drivers

This question came up during an interview at the ride-sharing service Lyft. Let’s explore how CASE expressions can help us find the answer.


Table: lyft_drivers

Link to the question: https://platform.stratascratch.com/coding/10016-churn-rate-of-lyft-drivers

Understand the question

Understanding churn rate is the main challenge of this question.

The churn rate essentially measures the share of users that left the platform compared to all users.

To calculate it, you need to find the number of users who left and divide that number by the number of all users.

Now you can write the formula and use values from the table to do the actual calculation.

Analyze data

To solve this question, we have to work with data in the lyft_drivers table, which has four columns.

  • index integer value, which is probably a unique identifier for drivers.
  • start_date datetime values most likely represent the date when driver first signed up for Lyft
  • end_date values specify the date when the driver stopped working for Lyft.
  • yearly_salary integer values stand for the driver’s yearly earnings in dollars.

Now, let’s see a preview of the table filled with data:

Table: lyft_drivers
indexstart_dateend_dateyearly_salary
02018-04-0248303
12018-05-3067973
22015-04-0556685
32015-01-0851320
42017-03-0967507

Previewing the table reveals one crucial detail - some driver records do not have an end_date value, but others do.

It’s safe to assume that drivers with an end_date value have stopped working for Lyft, and those without this value are still driving.

Plan your approach

To calculate the churn rate, we need to divide the number of users who left (stopped driving) by the number of all users.

You can use the COUNT(*) aggregate function to get the number of all drivers. To find the number of drivers who left the platform, you’ll have to combine CASE with the COUNT() aggregate function.

Remember that records that have an end_date value represent drivers who stopped driving. Drivers without the end_date value are still driving.

Write the code

Try your hand here:


Output

The query should output just one value - the churn rate of Lyft drivers. It needs to be a decimal number between 0 and 1.

All required columns and the first 5 rows of the solution are shown

global_churn_rate
0.495

Question 4: Extremely Late Delivery

This question was asked during an interview at the food delivery service DoorDash.


Table: delivery_orders

Link to the question: https://platform.stratascratch.com/coding/2113-extremely-late-delivery

Understand the question

In this question, you have to work with delivery order data and find the number of extremely late deliveries.

The question clearly states that an order can be considered ‘extremely late’ if it arrives 20 minutes later than expected.

The question tells you to output the percentage of extremely late orders in every month. It also provides instructions about the output format.

Analyze data

Let’s look at values in delivery_orders, the only table available for this question.

  • delivery_id contains integer values to identify each order
  • order_placed_time contains the exact date and time when the order was placed.
  • predicted_delivery_time values specify when the order was supposed to arrive. It’s safe to assume that we will look at values in this column to determine which orders are extremely late.
  • Datetime values in the actual_delivery_time column specify when the order was actually delivered.
  • Integer values in the delivery_rating column probably represent the customer’s feedback
  • dasher_id values most likely identify the courier.
  • restaurant_id identifies the restaurant that prepared the food
  • consumer_id identifies the person receiving the order

We can preview the delivery_orders table to verify our assumptions:

Table: delivery_orders
delivery_idorder_placed_timepredicted_delivery_timeactual_delivery_timedelivery_ratingdasher_idrestaurant_idconsumer_id
O21322021-11-17 04:45:332021-11-17 05:37:332021-11-17 05:58:334D239R633C1001
O21522021-12-09 19:09:432021-12-09 19:41:432021-12-09 19:41:433D238R635C1010
O21582022-01-04 02:31:192022-01-04 02:56:192022-01-04 03:21:194D239R634C1010
O21732022-02-09 00:45:222022-02-09 01:19:222022-02-09 01:33:220D239R633C1038
O21452021-12-04 17:20:272021-12-04 18:04:272021-12-04 18:31:271D239R634C1042

Based on our assessment, we’ll need to compare values in predicted_delivery_time and actual_delivery_time columns to find extremely late deliveries.

Plan your approach

Use CASE to set up a condition for identifying extremely late delivery. If the difference between predicted_delivery_time and actual_delivery_time values is more than 20, CASE should return 1; otherwise 0.

Pass it as an argument to the SUM() aggregate function, which will add up every value returned by the CASE expression.

To find the percentage, we need to divide the number of late orders by the number of all orders and multiply the ratio by 100.

SUM() and COUNT() aggregate functions should apply to each month, not the entire table. So we need to group records by month.

The output should show months in a specific format.

Write the code

Try your hand here:

Output

Final answer should show the percentage of late deliveries for each month. Note that the question has specific instructions on how the month needs to be formatted.

All required columns and the first 5 rows of the solution are shown

year_monthperc_extremely_delayed
2021-1130.769
2021-1212.5
2022-0210
2022-0136.364

Question 5: Find the percentage of shippable orders

In our last question, candidates have to calculate the percentage of all orders that are shippable.


Tables: orders, customers

Link to the question: https://platform.stratascratch.com/coding/10090-find-the-percentage-of-shipable-orders

Understand the question

The task is clear - find what percentage of all orders are shippable. We need to start by finding the number of all shippable orders.

The question description states that an order is shippable if it has an address.

Analyze data

Let’s look at values in two tables available for this question: orders and customers

  • id column contains integer values to identify each order
  • cust_id values identify the customer who placed the order
  • order_date specifies the year, month, and day when the order was placed.
  • Values in the order_details column describe the type of product ordered.
  • total_order_cost column refers to order size in dollars.

Let’s preview the orders table:

Table: orders
idcust_idorder_dateorder_detailstotal_order_cost
132019-03-04Coat100
232019-03-01Shoes80
332019-03-07Skirt30
472019-02-01Coat25
572019-03-10Shoes80

The data in this table looks standard.

Let’s examine values in the customers table.

  • id column identifies the customer
  • first_name contains the customer’s first name
  • last_name contains the customer’s last name
  • city specifies the city the user is from
  • address is likely the customer’s default shipping address
  • phone_number column contains the customer’s phone number

Now, the preview of the customers table:

Table: customers
idfirst_namelast_namecityaddressphone_number
8JohnJosephSan Francisco928-386-8164
7JillMichaelAustin813-297-0692
4WilliamDanielDenver813-368-1200
5HenryJacksonMiami808-601-7513
13EmmaIsaacMiami808-690-5201

Everything looks okay, except some customers don’t have an address value.

Plan your approach

To answer this question, we need to check if the person who placed the order has provided a shipping address. Records in the orders table give us information about orders, including the identity of the user responsible for the order. However, there is no information about whether or not the user has provided a shipping address.

Information about users’ addresses is stored in the customers table. Both tables have values that reference customer identities. This is a shared dimension we can use to combine data from two tables.

We can use CASE to check the address value. If it is NULL, then CASE should return a boolean value ‘false’, because orders placed by these customers are not shippable. If it’s a non-NULL value, the expression should return a boolean value ‘true’, because orders are shippable.

Save the result of the CASE expression in a new column, which will indicate whether or not the order is shippable.

As a final step, find the total number of orders that are shippable. Then find the percentage of orders that are shippable.

Write the code

Try your hand here:

Output

Final answer for this question is just one value - the percentage of shippable orders.

All required columns and the first 5 rows of the solution are shown

percent_shipable
28

Summary

In this “CASE WHEN statements in SQL” guide, we discussed syntax and many use cases of the CASE expression. You won’t use it as frequently as WHERE, GROUP BY, ORDER BY, and other major SQL features, but CASE is still very important and worth your time. When applicable, it can help you find simple answers to difficult SQL questions.

To demonstrate, we selected five interview questions with answers that revolved around CASE expression. Looking at our solutions and trying to answer the questions yourself will definitely help you master CASE expressions and their use.

You can maximize your chances of landing a job by signing up on the StrataScratch platform, which has hundreds of questions involving CASE and other SQL features.

The StrataScratch platform is the perfect place to practice writing queries. You can answer questions and see if your query would be accepted during an actual interview. If you get stuck, the platform has hints to point you in the right direction. You can see various solutions to the same problem and train your mind to find the most efficient solutions to each question.

CASE WHEN statements in SQL
Categories


Become a data expert. Subscribe to our newsletter.