### SQL Aggregate Functions - Interview Question by the City of San Francisco

##### Categories

*Learn the crucial SQL aggregate function concepts by solving this interview question.*

The SQL aggregate functions are one of the most important concepts when it comes to using SQL in data science. One shouldn’t be surprised this is also one of the most tested topics at the coding job interviews.

The good thing is there are 600 coding interview questions on StrataScratch. While not all are dedicated to the aggregate functions, a vast majority require knowing aggregate functions to one extent or another.

There are several important aspects of using the SQL aggregate functions, and the question by the City of San Francisco seems ideal for demonstration.

## The Interview Question

Link to the question: https://platform.stratascratch.com/coding/9930-find-libraries-with-the-highest-number-of-total-renewals

## What Do You Need to Know to Solve It?

The question tests using the SQL aggregate functions. Shortly, they are functions that perform calculations on multiple rows and return the result as one value, collapsing the individual rows in the process.

For more detailed explanations, use our SQL cheat sheet or a guide to the SQL aggregate functions.

Not only does this question ask for the SQL aggregate functions knowledge, but it also tests using these functions with the GROUP BY clause.

### Available Data

The data we have at our disposal consists of only one table: library_usage.

patron_type_code | int |

patron_type_definition | varchar |

total_checkouts | int |

total_renewals | int |

age_range | varchar |

home_library_code | varchar |

home_library_definition | varchar |

circulation_active_month | varchar |

circulation_active_year | float |

notice_preference_code | varchar |

notice_preference_definition | varchar |

provided_email_address | bool |

year_patron_registered | int |

outside_of_county | bool |

supervisor_district | float |

Here’s partial data from it.

The question asks to output the home library definition and its number of total renewals. To get this, we’ll need to use only two columns from the table: total_renewals and home_library_definition.

### Solution Approach

The solution code can be broken down into three distinct steps:

- Finding libraries
- Summing total renewals per library using the SUM() aggregate function and GROUP BY
- Ordering the result from the highest to the lowest number of renewals

#### Finding Libraries

To find the libraries, select the home_library_definition column from the table

```
SELECT home_library_definition
FROM library_usage;
```

Now we got the (partial) list of libraries.

Next, we need to find the number of total renewals for every library above.

#### SUM() and GROUP BY

SUM() is an aggregate function used for summing values. To get the sum of renewals, use the SUM() function on the column total_renewals.

```
SELECT home_library_definition,
SUM(total_renewals) AS total_lib_renewals
FROM library_usage;
```

The sum will appear in the new column named total_lib_renewals. Now, if you run this code, it will return an error that says the column home_library_definition must appear in the GROUP BY clause.

This is important when you have at least one column from the table beside the aggregate function column.

Every column that appears in the SELECT statement, except the aggregate function, must appear in the GROUP BY clause. Otherwise, it will return an error. This is logical because we want to see the total number of renewals by the library, which means we are grouping data according to the library.

It’s also important that you know the other way round works: not all columns that appear in the GROUP BY clause have to be in the SELECT statement. In other words, you can group data by a certain column, but you don’t need to show it in the output.

Also, if the interviewer asks you why you didn’t, then put the aggregate function in the GROUP BY clause, too: the aggregate functions are not allowed in GROUP BY.

This code

```
SELECT home_library_definition,
SUM(total_renewals) AS total_lib_renewals
FROM library_usage
GROUP BY home_library_definition;
```

will fetch the total number of renewals per library.

This is, data-wise, the correct output. Except, it’s not sorted the way the question asks.

#### Sorting Data in Descending Order

The output has to be sorted according to the sum of total renewals from the highest to the lowest sum. To do that, use the ORDER BY clause and order data in descending order.

```
SELECT home_library_definition,
SUM(total_renewals) AS total_lib_renewals
FROM library_usage
GROUP BY home_library_definition
ORDER BY total_lib_renewals DESC;
```

And that’s the solution to the question.

However, it doesn’t cover the edge case.

### The Edge Case

What if the question asked to output only the library with the highest number of total renewals, not just the names of libraries with the corresponding number of renewals?

And there are two or more libraries with the same (highest) number of total renewals?

### What Do You Need to Know to Solve It?

On top of the aggregate functions and the GROUP BY clause, the edge case requires

- Filtering data using the HAVING clause
- Writing a subquery

#### Filter Data - HAVING Clause

The first part of the edge case solution is almost the same as the original one without before the ORDER BY.

```
SELECT home_library_definition,
SUM(total_renewals) AS total_lib_renewals
FROM library_usage
GROUP BY home_library_definition;
```

These are the total renewals by the library. The data should be filtered to show only the libraries with the highest number of renewals.

### WHERE vs. HAVING

When data filtering is mentioned, many people will immediately think about using the WHERE clause. While, yes, WHERE is used for filtering data, it wouldn’t work here.

Why? The above part of the code has already aggregated data. To get the solution, we need to compare the total number of renewals with the highest number of renewals. This means we first need to aggregate data, then filter unnecessary data out.

This is where the difference between WHERE and HAVING comes.

The WHERE clause filters data ** before** aggregation, and it doesn’t accept the aggregate functions.

The HAVING clause is used for filtering data ** after** aggregation. And that’s exactly what we need here!

Also, be aware that WHERE is always written before the GROUP BY clause, and HAVING is written after—this corresponds with the logic of each clause.

The filtering criteria should compare the total_lib_renewals with the highest number of renewals, whichever library has it.

Start with putting SUM(total_renewals) in the HAVING clause. Why not simply write total_lib_renewals? Because naming this column occurs after the HAVING clause.

Then write a subquery that will return the maximum number of total renewals—one number, nothing more. Maybe counterintuitively, you need to use the SUM() aggregate function for a maximum number, not MAX(). By limiting the output to one row, you’ll get the maximum number of renewals, irrespective of how many libraries have this maximum number of renewals.

```
SELECT home_library_definition,
SUM(total_renewals) AS total_lib_renewals
FROM library_usage
GROUP BY home_library_definition
HAVING SUM(total_renewals) =
(SELECT SUM(total_renewals) AS total_lib_renewals
FROM library_usage
GROUP BY home_library_definition
ORDER BY total_lib_renewals DESC
LIMIT 1);
```

If you run only the subquery part, it will return one number.

In other words, the HAVING clause is looking for all the libraries that have 2,421 renewals.

Run the complete solution to get the desired output.

As you can see, there are two libraries with the highest number of renewals.

### Conclusion

Two important aggregate functions you learned by solving this question are

- Using the aggregate function with the GROUP BY clause
- Filtering data using the HAVING clause and a subquery.

These are all concepts that are a must for anyone aspiring to do well at the SQL job interview. You can find many more such SQL interview questions and helpful articles on our platform.