String and Array Functions in SQL for Data Science

String and Array Functions in SQL for Data Science


Commonly used string and array functions in SQL Data Science Interviews.

In a previous article "SQL Scenario Based Interview Questions", we had touched upon the various date and time functions in SQL. In this article, we will look at another important favorite topic in Data Science Interviews – string and array manipulation. With increasingly diverse and unstructured data sources becoming commonplace, string and array manipulation has become a integral part of Data Analysis and Data Science functions. The key ideas discussed in this article include

  • Cleaning Strings
  • String Matching
  • String Splitting
  • Creating Arrays
  • Splitting Arrays to rows
  • Aggregating text fields

You might also want to look at our Pandas article on string manipulation in DataFrame as we use quite a few similar concepts here as well.

String Matching

Let us start with a simple string-matching problem. This is from a past City of San Fransisco Data Science Interview Question.

Find the number of violations that each school had

Determine the number of violations for each school. Any inspection that does not have risk category as null is considered a violation. Print the school’s name along with the number of violations. Order the output in the descending order of the number of violations.

String and Array Functions Data Science Interview Question

You can solve this problem here: https://platform.stratascratch.com/coding/9727-find-the-number-of-violations-that-each-school-had

The problem uses the sf_restaurant_health_violations dataset with the following fields.

sf_restaurant_health_violations

business_idint
business_namevarchar
business_addressvarchar
business_cityvarchar
business_statevarchar
business_postal_codefloat
business_latitudefloat
business_longitudefloat
business_locationvarchar
business_phone_numberfloat
inspection_idvarchar
inspection_datedatetime
inspection_scorefloat
inspection_typevarchar
violation_idvarchar
violation_descriptionvarchar
risk_categoryvarchar

The relevant data in the table looks like this.

Table: sf_restaurant_health_violations
business_idbusiness_namebusiness_addressbusiness_citybusiness_statebusiness_postal_codebusiness_latitudebusiness_longitudebusiness_locationbusiness_phone_numberinspection_idinspection_dateinspection_scoreinspection_typeviolation_idviolation_descriptionrisk_category
5800John Chin Elementary School350 Broadway StSan FranciscoCA9413337.798-122.403{'longitude': '-122.403154', 'needs_recoding': False, 'latitude': '37.798358', 'human_address': '{"address":"","city":"","state":"","zip":""}'}5800_201710172017-10-1798Routine - Unscheduled5800_20171017_103149Wiping cloths not clean or properly stored or inadequate sanitizerLow Risk
64236Sutter Pub and Restaurant700 Sutter StSan FranciscoCA9410237.789-122.412{'longitude': '-122.41188', 'needs_recoding': False, 'latitude': '37.78881', 'human_address': '{"address":"","city":"","state":"","zip":""}'}64236_201707252017-07-2588Routine - Unscheduled64236_20170725_103133Foods not protected from contaminationModerate Risk
1991SRI THAI CUISINE4621 LINCOLN WaySan FranciscoCA9412237.764-122.508{'longitude': '-122.507779', 'needs_recoding': False, 'latitude': '37.764073', 'human_address': '{"address":"","city":"","state":"","zip":""}'}1991_201711292017-11-2986Routine - Unscheduled1991_20171129_103139Improper food storageLow Risk
3816Washington Bakery & Restaurant733 Washington StSan FranciscoCA9410837.795-122.406{'longitude': '-122.405845', 'needs_recoding': False, 'latitude': '37.795174', 'human_address': '{"address":"","city":"","state":"","zip":""}'}3816_201607282016-07-2867Routine - Unscheduled3816_20160728_103108Contaminated or adulterated foodHigh Risk
39119Brothers Restaurant4128 GEARY BlvdSan FranciscoCA9411837.781-122.464{'longitude': '-122.463762', 'needs_recoding': False, 'latitude': '37.781148', 'human_address': '{"address":"","city":"","state":"","zip":""}'}39119_201607182016-07-1879Routine - Unscheduled39119_20160718_103133Foods not protected from contaminationModerate Risk

The relevant columns are business_name and risk_category

These columns are populated thus.

business name and risk category

Approach and Solution

This is a relatively straightforward problem. We need

  • Identify “Schools” from the business category
  • Count the violations excluding the rows where the risk_category is NULL

The simplest string-matching function in SQL is the LIKE function that searches for a substring inside a larger string. However, one needs to use wildcards to ensure the correct match is found. Since we do not know for sure that Schools end with the word School, we use the % wildcard before and after the string to ensure that the word “SCHOOL” is searched for. Further, we use the ILIKE function to make a case-insensitive search. The solution is now very simple.

SELECT business_name,
       COUNT(*) AS num_violations
FROM sf_restaurant_health_violations
WHERE business_name ILIKE '%SCHOOL%'
  AND risk_category IS NOT NULL
GROUP BY 1 ;

If your SQL flavor does not have the ILIKE statement, we can convert the string to upper or lower case and then use the LIKE statement.

Splitting a Delimited String

Now that we have warmed up with string search, let us try another common string manipulation technique: splitting. There are numerous use cases for splitting a string. Splitting a string requires a delimiter (a separator). To illustrate this let us look at a problem from another City of San Francisco Data Science Interview problem.

Business Density Per Street

Find the highest and the average number of businesses in all the streets. Only consider those streets that have five or more businesses. Assume that the second word in the Address field represents the street name for the business. Note there might be multiple entries for the same business in the dataset. Count each business only once.

SQL String Functions Data Science Interview Question

You can solve the problem on the StrataScratch Platform here: https://platform.stratascratch.com/coding/9735-business-density-per-street

This problem uses the same sf_restaurant_health_violations used in the previous problem. The fields of interest for this problem are business_id and business_address which are populated thus.

Dataset for String Functions Data Science Interview Question

Approach and Solution

We need to extract the second word from the address. That represents the street name. To do this we split the string using space as a delimiter (separator) and extracting the second word. We can do this using the SPLIT_PART function. This is similar to the split() method in Python. Since Postgres is case sensitive, we convert the output to upper case.

SELECT UPPER(split_part(business_address, ' ', 2)) AS streetname,
       business_address
FROM sf_restaurant_health_violations ;

We get the following output.

Output for SQL String Functions Interview Question

Now the problem becomes relatively easy to solve. We find the number of distinct business entries on each street. Since we need only those businesses with five or more entries, we use the HAVING clause to subset the output.

SELECT UPPER(split_part(business_address, ' ', 2)) AS streetname,
       COUNT (DISTINCT business_id) AS density
FROM sf_restaurant_health_violations
GROUP BY 1
HAVING COUNT (DISTINCT business_id) >= 5 ;

We get the following output.

Output 2 for SQL String Functions Interview Question

Now we can aggregate this table using a subquery, CTE or a temp table. We have used a CTE in this case and get the final output.

WITH rel_businesses AS
  (SELECT UPPER(split_part(business_address, ' ', 2)) AS streetname,
          COUNT (DISTINCT business_id) AS density
   FROM sf_restaurant_health_violations
   GROUP BY 1
   HAVING COUNT (DISTINCT business_id) >= 5)
SELECT AVG(density),
       MAX(density)
FROM rel_businesses ;

Arrays

SQL String and Array Functions for Data Science

Most modern SQL flavors allow creation and manipulation of arrays. Let us look at working with string arrays. One can manipulate integer and floating-point arrays in a similar manner. To illustrate this let us take an SQL Data Science Interview problem for an AirBnB interview.

City With Most Amenities

Find the city with most amenities in the given dataset. Each row in the dataset represents a unique host. Output the name of the city with the most amenities.

SQL Array Functions Interview Question

You can solve the problem here: https://platform.stratascratch.com/coding/9633-city-with-most-amenities

The problem uses the airbnb_search_details dataset with the following fields.

airbnb_search_details

idint
pricefloat
property_typevarchar
room_typevarchar
amenitiesvarchar
accommodatesint
bathroomsint
bed_typevarchar
cancellation_policyvarchar
cleaning_feebool
cityvarchar
host_identity_verifiedvarchar
host_response_ratevarchar
host_sincedatetime
neighbourhoodvarchar
number_of_reviewsint
review_scores_ratingfloat
zipcodeint
bedroomsint
bedsint

The main fields of interest here are city and amenities that are populated thus.

Data for SQL Array Functions Interview Question

Approach and solution

To solve this let us break this problem into parts.

  • We need to find the number of amenities for a given property
  • Aggregate the amenities at city level
  • Find the city with the highest number of amenities.

The amenities are represented in form of a string separated by commas. However, SQL right now recognizes this field as string. So, we need to convert this string into individual amenities by splitting them using the comma delimiter. To do this we use the STRING_TO_ARRAY() function and specify comma as the delimiter.

SELECT city,
       STRING_TO_ARRAY(amenities, ',') AS num_amenities
FROM airbnb_search_details ;

We get the following output.

Output for SQL Array Functions Interview Question

Note for this problem, opening and closing braces are considered a part of the first and last word in the string. If we want to eliminate to clean the string, we can use the BTRIM function. BTRIM function will remove all the leading and trailing characters specified. We can modify our query in the following manner.

SELECT city,
       STRING_TO_ARRAY(BTRIM(amenities, '{}'), ',') AS num_amenities
FROM airbnb_search_details ;

This gives us the following output. As one can see we have successfully removed the leading and trailing braces.

Output 2 for SQL Array Functions Interview Question

To find the number of amenities, we need to count the number of elements in the amenities array. We can do this by using the ARRAY_LENGTH() function. The function requires us to specify the array dimension whose length is to be specified. This is useful for multi-dimensional arrays. Since our array is 1-dimensional, we simply specify 1.

SELECT city,
       ARRAY_LENGTH(STRING_TO_ARRAY(BTRIM(amenities, '{}') , ',') , 1) AS num_amenities
FROM airbnb_search_details ;

Our output looks like this

Output 3 for SQL Array Functions Interview Question

We now proceed to aggregate the number of amenities at city level.

SELECT city,
       SUM(ARRAY_LENGTH(STRING_TO_ARRAY(BTRIM(amenities, '{}'), ','), 1)) AS num_amenities
FROM airbnb_search_details
GROUP BY 1 ;

Our output now looks like this.

Output 4 for SQL Array Functions Interview Question

We can now find the city with the highest number of amenities by sorting in descending order and using LIMIT 1 or more reliably, by ranking them.

SELECT CITY
FROM
  (SELECT city,
          DENSE_RANK() OVER (
                             ORDER BY num_amenities DESC) AS rank
   FROM
     (SELECT city ,
             SUM(ARRAY_LENGTH(STRING_TO_ARRAY(BTRIM(amenities, '{}'), ','), 1)) AS num_amenities
      FROM airbnb_search_details
      GROUP BY 1) Q1) Q2
WHERE rank = 1 ;

Splitting an Array

The above problem could have also been solved by exploding the array into individual rows and then aggregating the number of amenities for each city. Let us use this method in another SQL data science question from Meta (Facebook) interview.

Views Per Keyword

Find the number of views for each keyword. Report the keyword and the total views in the decreasing order of the views.

SQL Array Functions Interview Question from Facebook

You can solve the problem here: https://platform.stratascratch.com/coding/9791-views-per-keyword

The problem uses the facebook_posts and facebook_post_views datasets. The fields present in the facebook_posts dataset are

facebook_posts

post_idint
posterint
post_textvarchar
post_keywordsvarchar
post_datedatetime

The data is presented in the following manner

Data for SQL Array Functions Interview Question from Facebook

The facebook_post_views has the following fields

facebook_post_views

post_idint
viewer_idint

And this is how the data in looks

Data 2 for SQL Array Functions Interview Question from Facebook

Approach and Solution

Let us break this problem into individual parts.

  • We start off by merging the two datasets on the post_id field. We need to aggregate the number of views for each post
SELECT fp.post_id,
       fp.post_keywords,
       COALESCE(COUNT(DISTINCT fpv.viewer_id), 0) AS num_views
FROM facebook_posts fp
LEFT JOIN facebook_post_views fpv ON fp.post_id = fpv.post_id
GROUP BY 1,
         2 ;

We get the following output.

Output for SQL Array Functions Interview Question from Facebook
  • We need to assign the views to each keyword. For example for post_id = 3, the keyword sphagetti and the keyword food should each get 3 views. For post_id = 4, the spam keyword should get 3 views and so on. To accomplish this, we first clean the string stripping the brackets and the # symbol.
SELECT fp.post_id,
       STRING_TO_ARRAY(BTRIM(fp.post_keywords, '[]#'), ',') AS keyword,
       COALESCE(COUNT(DISTINCT fpv.viewer_id), 0) AS num_views
FROM facebook_posts fp
LEFT JOIN facebook_post_views fpv ON fp.post_id = fpv.post_id
GROUP BY 1,
         2 ;

We get the following output

Output 2 for SQL Array Functions Interview Question from Facebook
  • Now we separate (explode) the array into individual records using the UNNEST function.
SELECT fp.post_id,
       UNNEST(STRING_TO_ARRAY(BTRIM(fp.post_keywords, '[]#'), ',')) AS keyword,
       COALESCE(COUNT(DISTINCT fpv.viewer_id), 0) AS num_views
FROM facebook_posts fp
LEFT JOIN facebook_post_views fpv ON fp.post_id = fpv.post_id
GROUP BY 1,
         2 ;
Output 3 for SQL Array Functions Interview Question from Facebook
  • We can now easily aggregate the number of views per keyword and sort them in descending order.
WITH exp_keywords AS
  (SELECT fp.post_id ,
          UNNEST(STRING_TO_ARRAY(BTRIM(fp.post_keywords, '[]#'), ',')) AS keyword ,
          COALESCE(COUNT(DISTINCT fpv.viewer_id), 0) AS num_views
   FROM facebook_posts fp
   LEFT JOIN facebook_post_views fpv ON fp.post_id = fpv.post_id
   GROUP BY 1,
            2)
SELECT keyword,
       sum(num_views) AS total_views
FROM exp_keywords
GROUP BY 1
ORDER BY 2 DESC ;

Aggregating Text Fields

Let us finish things off by doing the converse. Aggregating rows back into a string. We illustrate this with a SQL Data Science Interview question from Google.

File Contents Shuffle

Rearrange the words of the filename final.txt to make a new file named wacky.txt. Sort all the words in alphabetical order, output the words in column and the filename wacky.txt in another.

SQL Array Functions Interview Question from Google

You can solve the problem here: https://platform.stratascratch.com/coding/9818-file-contents-shuffle

The problem uses the google_file_store dataset with the following columns.

google_file_store

filenamevarchar
contentsvarchar

The contents of the dataset look like this.

Data for SQL Array Functions Interview Question from Google

Approach and Solution

Let us solve this problem in a step-wise manner.

  • We first keep only the contents of the filename final.txt, split the contents using space as a delimiter, explode the resulting array into individual rows and sort in alphabetical order.
SELECT UNNEST(STRING_TO_ARRAY(CONTENTS, ' ')) AS words
FROM google_file_store
WHERE filename ILIKE '%FINAL%'
ORDER BY 1 ;

We get the following output

Output for SQL Array Functions Interview Question from Google
  • We now need to combine the individual words back into a string. To do this we use the STRING_AGG() function and specify space as the delimiter. This function is similar to the join() method in Python. We also add a filename for the new string and output.
WITH exploded_arr AS
  (SELECT UNNEST(STRING_TO_ARRAY(CONTENTS, ' ')) AS words
   FROM google_file_store
   WHERE filename ILIKE '%FINAL%'
   ORDER BY 1)
SELECT 'wacky.txt' AS filename,
       STRING_AGG(words, ' ') AS CONTENTS
FROM exploded_arr ;

Conclusion

In this article we looked at the text and array manipulation abilities of SQL. This is specifically useful in ETL process upstream as well as Analysis downstream. As with other Data Science areas, only patience, persistence and practice can make you proficient. On StrataScratch, we have over 700 coding and non-coding problems that are relevant to Data Science Interviews. These problems appeared in actual Data Science interviews at top companies like Google, Amazon, Microsoft, Netflix, et al. For e.g., check out our posts "40+ Data Science Interview Questions From Top Companies" and "The Ultimate Guide to SQL Interview Questions" to practice such interview questions and prepare for the most in-demand jobs at big tech firms and start-ups across the world.

String and Array Functions in SQL for Data Science


Become a data expert. Subscribe to our newsletter.