SQL Aggregate Functions - Interview Question by the City of San Francisco

SQL aggregate functions interview questions

Learn the crucial SQL aggregate function concepts by solving this interview question.

The SQL aggregate functions are one of the most important concepts when it comes to using SQL in data science. One shouldn’t be surprised this is also one of the most tested topics at the coding job interviews.

The good thing is there are 600 coding interview questions on StrataScratch. While not all are dedicated to the aggregate functions, a vast majority require knowing aggregate functions to one extent or another.

There are several important aspects of using the SQL aggregate functions, and the question by the City of San Francisco seems ideal for demonstration.

The Interview Question

SQL aggregate functions interview question

Link to the question: https://platform.stratascratch.com/coding/9930-find-libraries-with-the-highest-number-of-total-renewals

What Do You Need to Know to Solve It?

Solving SQL aggregate functions interview question

The question tests using the SQL aggregate functions. Shortly, they are functions that perform calculations on multiple rows and return the result as one value, collapsing the individual rows in the process.

For more detailed explanations, use our SQL cheat sheet or a guide to the SQL aggregate functions.

Not only does this question ask for the SQL aggregate functions knowledge, but it also tests using these functions with the GROUP BY clause.

Available Data

The data we have at our disposal consists of only one table: library_usage.

patron_type_codeint
patron_type_definitionvarchar
total_checkoutsint
total_renewalsint
age_rangevarchar
home_library_codevarchar
home_library_definitionvarchar
circulation_active_monthvarchar
circulation_active_yearfloat
notice_preference_codevarchar
notice_preference_definitionvarchar
provided_email_addressbool
year_patron_registeredint
outside_of_countybool
supervisor_districtfloat

Here’s partial data from it.

Data for SQL aggregate functions interview question

The question asks to output the home library definition and its number of total renewals. To get this, we’ll need to use only two columns from the table: total_renewals and home_library_definition.

Solution Approach

The solution code can be broken down into three distinct steps:

  1. Finding libraries
  2. Summing total renewals per library using the SUM() aggregate function and GROUP BY
  3. Ordering the result from the highest to the lowest number of renewals

Finding Libraries

To find the libraries, select the home_library_definition column from the table

SELECT home_library_definition
FROM library_usage;

Now we got the (partial) list of libraries.

Solution output for SQL aggregate functions interview question

Next, we need to find the number of total renewals for every library above.

SUM() and GROUP BY

SUM() is an aggregate function used for summing values. To get the sum of renewals, use the SUM() function on the column total_renewals.

SELECT home_library_definition, 
    	 SUM(total_renewals) AS total_lib_renewals
FROM library_usage;

The sum will appear in the new column named total_lib_renewals. Now, if you run this code, it will return an error that says the column home_library_definition must appear in the GROUP BY clause.

Solution output error in the interview question

This is important when you have at least one column from the table beside the aggregate function column.

Every column that appears in the SELECT statement, except the aggregate function, must appear in the GROUP BY clause. Otherwise, it will return an error. This is logical because we want to see the total number of renewals by the library, which means we are grouping data according to the library.

It’s also important that you know the other way round works: not all columns that appear in the GROUP BY clause have to be in the SELECT statement. In other words, you can group data by a certain column, but you don’t need to show it in the output.

Also, if the interviewer asks you why you didn’t, then put the aggregate function in the GROUP BY clause, too: the aggregate functions are not allowed in GROUP BY.

This code

SELECT home_library_definition, 
       SUM(total_renewals) AS total_lib_renewals
FROM library_usage
GROUP BY home_library_definition;

will fetch the total number of renewals per library.

Solution output for SQL aggregate functions interview question

This is, data-wise, the correct output. Except, it’s not sorted the way the question asks.

Sorting Data in Descending Order

The output has to be sorted according to the sum of total renewals from the highest to the lowest sum. To do that, use the ORDER BY clause and order data in descending order.

SELECT home_library_definition, 
       SUM(total_renewals) AS total_lib_renewals
FROM library_usage
GROUP BY home_library_definition
ORDER BY total_lib_renewals DESC;

And that’s the solution to the question.

Solution output for SQL aggregate functions interview question

However, it doesn’t cover the edge case.

The Edge Case

What if the question asked to output only the library with the highest number of total renewals, not just the names of libraries with the corresponding number of renewals?

And there are two or more libraries with the same (highest) number of total renewals?

What Do You Need to Know to Solve It?

On top of the aggregate functions and the GROUP BY clause, the edge case requires

  1. Filtering data using the HAVING clause
  2. Writing a subquery

Filter Data - HAVING Clause

The first part of the edge case solution is almost the same as the original one without before the ORDER BY.

SELECT home_library_definition,
	 SUM(total_renewals) AS total_lib_renewals
FROM library_usage 
GROUP BY home_library_definition;
Solution output for SQL aggregate functions interview question

These are the total renewals by the library. The data should be filtered to show only the libraries with the highest number of renewals.

WHERE vs. HAVING

When data filtering is mentioned, many people will immediately think about using the WHERE clause. While, yes, WHERE is used for filtering data, it wouldn’t work here.

Why? The above part of the code has already aggregated data. To get the solution, we need to compare the total number of renewals with the highest number of renewals. This means we first need to aggregate data, then filter unnecessary data out.

This is where the difference between WHERE and HAVING comes.

The WHERE clause filters data before aggregation, and it doesn’t accept the aggregate functions.

The HAVING clause is used for filtering data after aggregation. And that’s exactly what we need here!

Also, be aware that WHERE is always written before the GROUP BY clause, and HAVING is written after—this corresponds with the logic of each clause.

The filtering criteria should compare the total_lib_renewals with the highest number of renewals, whichever library has it.

Start with putting SUM(total_renewals) in the HAVING clause. Why not simply write total_lib_renewals? Because naming this column occurs after the HAVING clause.

Then write a subquery that will return the maximum number of total renewals—one number, nothing more.  Maybe counterintuitively, you need to use the SUM() aggregate function for a maximum number, not MAX(). By limiting the output to one row, you’ll get the maximum number of renewals, irrespective of how many libraries have this maximum number of renewals.

SELECT home_library_definition,
       SUM(total_renewals) AS total_lib_renewals
FROM library_usage
GROUP BY home_library_definition
HAVING SUM(total_renewals) =
  (SELECT SUM(total_renewals) AS total_lib_renewals
   FROM library_usage
   GROUP BY home_library_definition
   ORDER BY total_lib_renewals DESC
   LIMIT 1);

If you run only the subquery part, it will return one number.

Solution output for SQL aggregate functions interview question

In other words, the HAVING clause is looking for all the libraries that have 2,421 renewals.

Run the complete solution to get the desired output.

Solution output for SQL aggregate functions interview question

As you can see, there are two libraries with the highest number of renewals.

Conclusion

Two important aggregate functions you learned by solving this question are

  1. Using the aggregate function with the GROUP BY clause
  2. Filtering data using the HAVING clause and a subquery.

These are all concepts that are a must for anyone aspiring to do well at the SQL job interview. You can find many more such SQL interview questions and helpful articles on our platform.

SQL aggregate functions interview questions

Become a data expert. Subscribe to our newsletter.