Data Engineer vs Data Scientist: Similarities and Differences

Data Engineer vs Data Scientist


Data Engineer vs Data Scientist: Explaining what data scientists and data engineers have in common. And what they don’t.

If the data scientist and data engineer’s similarities and differences haven’t been clear to you, you’re at the right place. We are about to mend this.

This 'Data Engineer vs Data Scientist' article is here to help you understand both job positions and choose the one that’s right for you. We’ll break down both jobs in terms of education, salary, job description, and technical skills required.

Here’s a table showing an overview of the similarities and differences between data scientists and data engineers. These are just bullet points, so you get the feeling of what we’re going to talk about. But don’t worry, we’ll go into more detail very soon.

Data Engineer vs Data Scientist

Data Engineer vs. Data Scientist

What is a Data Scientist?

The broadest definition of data science says it’s a discipline that combines computer programming, mathematics, and statistics to gain insights from data. Data scientists are heavily involved in data, so they will organize it, clean it, and analyze it. This data will then be used to build the machine learning (ML) models. The ML models help find trends and patterns in data, so the business decisions in the future have the intended effect. And this intended effect is usually the company’s improved performance in terms of sales, costs, and profit.

What is a Data Engineer?

A data scientist is the most general and broad job title there is in the data science field. It requires most of the skills you get if you’re coming from a data science background. All other jobs are, more or less, derived from a data scientist and focused on one particular job of a data scientist.

The same is true for a data engineer. They are mostly focused on data infrastructure, which involves getting data to other users. In other words, they maintain, optimize, and develop data infrastructure. They work with raw data and its quality, availability, and readability. They analyze it and transform it into formats suitable for others to use. They maintain the data pipeline, which means they are concerned with extracting, transforming, and loading (ETL) data.

Data Engineer vs Data Scientist: Formal Education

Data Engineer vs Data Scientist Formal Education

Just to make sure from the beginning: there are no strict rules that you must follow to become a data scientist or a data engineer. The same goes for formal education. There are data scientists and data engineers with very different educational backgrounds, working experience, etc. You don’t even have to have a college degree. However, it helps.

For example, having a BS. Especially if it’s a degree in computer science, statistics, mathematics, engineering, IT, economics, or any relevant quantitative field. While there are no rules, it must be said that most data scientists and data engineers come from these fields of study.

If you have an MS or even Ph.D. in any of these fields above, even better. This could significantly boost your chances of getting a job and the salary you can negotiate.

Data Engineer vs Data Scientist: Courses and Certifications

It doesn’t matter if you have formal education or not; finishing some courses in your free time and getting the certifications will be very beneficial to your career. Not only because of the knowledge this usually guarantees, but also because it shows your willingness to improve your knowledge and work on yourself continually.

Data Scientist

Some of the additional educations and certifications you can take are:

  • IBM Data Science Professional Certificate
  • Microsoft Certified: Azure Data Scientist Associate
  • SAS Certified Data Scientist
  • SAS Certified AI & Machine Learning Professional
  • Dell EMC Data Science Track (EMCDS)

Data Engineer

You can use the above suggestions if you’re (to become) a data engineer, too. After all, data engineering is part of data science. However, some certifications are more oriented to data engineering specifically, such as:

  • Google Professional Data Engineer
  • Cloudera Certified Professional (CCP): Data Engineer
  • IBM Certified Data Engineer – Big Data
  • SAS Certified Big Data Professional
  • Data Science Council of America (DASCA) Associate Big Data Engineer

Data Engineer vs Data Scientist: Technical Requirements

Data Scientist

Technical Skills

The technical skills you’ll need as a data scientist is in line with the job description. First, you have to be able to retrieve data, manipulate it, analyze it, and then visualize it. For that to do, you have to have sound programming skills. You’ll also rather often have to extract data from API, so you’ll have to know how to do that. The purpose of all this data handling is to build, test, and deploy a machine learning model. For that, you’ll again need programming skills, but also skills in statistics, mathematics, and AI. Finally, you’ll need cloud computing skills because you need to deploy the model somehow and put it into practice.

Programming Languages

While data scientists are not programmers exclusively, they need to have a very strong knowledge of some programming languages. The languages that are usually used in data science are:

  • SQL
  • R
  • Python
  • Java/JavaScript
  • C/C++/C#

The first three ones are the most popular.

Tools

Data scientists usually work with large amounts of data. They will be working a lot with relational (e.g., MS SQL Server, PostgreSQL, MySQL) and NoSQL (e.g., MongoDB, Cassandra, CouchBase) databases and cloud-based data-warehouses such as Snowflake or HIVE. Speaking of the cloud, you’ll probably be working with the cloud databases since an increasing number of data science companies are moving to the cloud. Examples of such cloud databases are Amazon Web Service, Microsoft Azure, and Google Cloud, but they’re not the only ones, for sure. Finally, data scientists use data science and machine learning tools such as Jupyter Notebooks, MATLAB, KNIME, MS Azure-learning Studio, IBM Watson Machine Learning, etc.

Data Engineer

Technical Skills

The data engineer’s focus is working with data infrastructure and raw data in general. This means data cleaning, preparation, manipulation, and analysis. This again means you have to be good at programming and extracting data from APIs. Even though you’ll need mathematics and statistics skills for data analysis, you won’t use them that extensively like the data scientist. The main reason is you won’t be building ML models as a data engineer.

What you’ll do is ensure data scientists (and other colleagues working with data) have data they can build ML models on. To be good at that, you’ll need to have knowledge in databases, data warehousing, and ETL/ELT of data.

Programming Languages

You’ll be using all the programming languages commonly used in data science:

  • SQL
  • R
  • Python
  • Java/JavaScript
  • C/C++/C#

There are also two additional languages that are fairly common in data engineering:

  • Scala
  • Go

Both languages are primarily used in handling big data.

Tools

You’ll use most of the tools the data scientist is using. As you work with data, you’ll for sure need work in some of the most popular RDBMSs and NoSQL databases. You’ll be using cloud databases and data-warehouses. Generally, you won’t be using data science and machine learning tools. However, what you will use are the ETL tools, such as Microsoft SSIS, XPlenty, Talend, Cognos Data Manager, etc. With the ELT approach becoming ever more popular in data engineering, you’ll probably use the ELT tools too. Some examples are Talend, Hevo, Kafka, etc.

Data Engineer vs Data Scientist: Job Opportunities

Data Engineer vs Data Scientist job

Data Scientist

At the time of writing, there were more than 5,260 data scientist jobs on Glassdoor in the USA.

Options, where you can work as a data scientist are diverse. It can be various scientific institutions and academies, financial institutions, pharmaceutical companies, consulting firms, engineering and tech companies, and anything in between. Basically, every company with a dose of self-respect uses data extensively, which means your expertise applies to (almost) any industry. What you sometimes need is some specific business and industry knowledge, which usually comes with experience.

Here we provide a very specific and practical guide on how to get a Data Science Job.

Data Engineer

For data engineers, there were almost 3,215 jobs advertised on Glassdoor in the USA at the time of writing.

Regarding the diversity of the options, it’s similar to the data scientists. A really wide range of companies offers data engineering positions of various seniorities, such as Microsoft, Cisco Systems, Spotify, Netflix, MasterCard, Amazon Web Services, CLS Bank International, Adobe, University of Arizona, Apple, Tesla, Intel, Procter & Gamble, The New York Times, and so on.

Data Engineer vs Data Scientist: Salary

Data Scientist

On average, data scientists earn around $164k, according to Glassdoor. The total pay ranges between $141k and $192k, depending on the seniority and the company.

The estimated base pay is $145k/yr and the estimated additional pay is $19k/yr.

Our article How Much Do Data Scientists Make can help you find out about current salaries and how they are influenced by several factors.

Data Engineer

As a data engineer, you’ll earn a little less on average compared to data scientists. According to Glassdoor, this means around $115k of annual salary. The lowest reported salary is $93k, while the highest is around $144k. Again, this heavily depends on your experience, education, position seniority, and the company you work for.

Data Engineer vs Data Scientist: Job Interview Questions

To start your career as a data scientist or a data engineer, you need to start working somewhere. This includes going to the job interviews and answering the questions designed to test your knowledge. The questions are also non-technical, as with any other job, but we won’t be focusing on this type of questions. We’ll go through the questions that test some specific technical skills required for these two positions.

Data Scientist

The technical interview questions you’ll most likely get at the interview can be divided into the following categories:

  • Coding
  • Probability & statistics
  • Modeling
  • Technical
  • Product

We’ll go through every category, showing you the example of the data science interview questions.

Coding

The data science coding interview questions are there to test your programming language skills. Here’s an example from Google that tests your SQL coding skills:

Correlation Between E-mails And Activity Time

“There are two tables with user activities. The 'google_gmail_emails` table contains information about emails being sent to users. Each row in the table represents a message with a unique identifier in the `id` field. The `google_fit_location` table contains user activity logs from the Google Fit app.
Here you'll find the correlation between the number of emails received and the total exercise per day. The total exercise per day is calculated by counting the number of user sessions per day.”

Answer:

SELECT corr(COALESCE(n_emails :: NUMERIC, 0), COALESCE(total_exercise :: NUMERIC, 0))
FROM
  (SELECT to_user,
          DAY,
          COUNT(*) AS n_emails
   FROM google_gmail_emails
   GROUP BY to_user,
            DAY) mail_base
FULL OUTER JOIN
  (SELECT user_id,
          DAY,
          COUNT(DISTINCT session_id) AS total_exercise
   FROM google_fit_location
   GROUP BY user_id,
            DAY) loc_base ON mail_base.to_user = loc_base.user_id
AND mail_base.DAY = loc_base.DAY

Probability & Statistics

Along with coding, you’ll need statistics knowledge to do your main job, which is building ML models. Expect to get the questions similar to this one from DE Shaw & Co:

Expectation Of A Gaussian

“You are given two Gaussian variables: X_1 and X_2 with means m_1, m_2 and variance v_1, v_2.
Suppose you know the sum X_1 + X_2 is equal to n. What is the expected value of X_2?”

Answer:

One possible answer:

If we know that the sum of X_1 and X_2 is equal to n, this means that the two Gaussian variables are not independent of each other.

Also, we know that the expected value of a random variable with Gaussian distribution to be:

E[X]=μx

which means that the expected value of a random variable with Gaussian distribution is equal to its mean value.

E[x1]+E[x2]=E[x1+x2]=n
μ1+μ2=n
μ2=n−μ1

Modeling

The modeling questions test your skills in building models. Usually, this also means testing your statistics knowledge but on a less theoretical level. Such an example is the question from Via Transportation (New York):

Changing the Scale of Distance

“If we want to build a logistic regression model with the distance (between the rider's current location and the pickup location) as the feature and the rider's acceptance as output, what would be the meaning of the coefficient of the feature? What will happen to the model if we change the scale of the distance (from the mile to km, or from km to m)?”

Technical

The technical questions usually test your knowledge of some programming language. Compared to the coding questions, the technical questions are not requiring you to write a code, but rather answer in a descriptive way. Here’s an example of such a question, this one’s from Walmart testing your Python knowledge:

Data Structures in Python

“What are the data structures in Python?”

Answer:

Commonly used data structures in Python:

List: a List is mutable, can contain duplicate records, and can contain different types of objects, whether it's a string, int, float, etc.

my_list = ['banana', 8, 3.14, 'banana']

Set: a Set contains an unordered collection of objects. However, a Set contains only a unique record in each of its elements.

fruit = {'grapes', 'banana', 'apple', 'banana'}
print(fruit)
>>> {'grapes', 'banana', 'orange'}

Tuple: a Tuple is similar to a List. One distinct difference between a Tuple to a List is that Tuple is immutable, i.e. once we write the element of a Tuple, we can't change them dynamically later on. It can only be read.

my_tuple = 'banana', 8, 3.14
print(my_tuple)
>>> ( 'banana', 8, 3.14)
my_tuple[0] = 'apple'
>>>TypeError: 'tuple' object does not support item assignment

Dictionary: a Dictionary contains a key-value pair in its element where the key is always unique.

employee = {'1001': 'David', '2002': 'Jack'}
print(employee['1001'])
>>> 'David'

Product

These questions test your technical skills, but also your knowledge of the company’s products and understanding of their business. For example, eBay asks this question on the job interview:

Identify Ebay Objects

“Ebay has to identify the cameras from the other objects like tripods, cables and batteries. What would be your approach? Data include ads title, description of the product, price, images etc.”

Data Engineer

The job interview questions for data engineers don’t differ much from those for data scientists. The question categories you should expect are:

  • Coding
  • Probability & statistics
  • Technical
  • Product
  • System design

The main difference is that you won’t be asked the modeling questions due to the nature of the job. And the probability & statistics questions will be rarer and probably easier. However, you’ll get one question category that the data scientists don’t get: system design questions.

The question by Facebook is one such example:

Comparing Performance of Engines

“How would you compare the relative performance of two different backend engines for automated generation of Facebook "Friend" suggestions?”

Another example would be the one asked by General Assembly:

Python Dictionary to Store Data

“When would I use a Python dictionary to store data, instead of another data structure?”

The questions in other categories are more or less the same you could expect at a data scientist interview. Bear in mind that they could be more raw data and data architecture oriented.

For example, the technical question could be something like the one by Airbnb:

Impute Missing Information

“How would you impute missing information?”

Answer:

“The methods to deal with missing values depend on the types of data that we have: either it is numerical data or categorical data.

If we have numerical data:

  • Use mean value to fill the missing data. This method is perfectly suited when the proportion of missing values is small. However, we need to make sure that there is no outlier in our data. If we have an outlier, filling missing values with the mean value would introduce bias in our data.
  • Use median value to fill the missing data. This method is also perfectly suited when the proportion of missing values is small. Median is more robust to outliers in comparison with mean. If there are outliers in our data, then the median would be a better choice.
  • Use forward fill to fill the missing data. If we're dealing with data that has a pattern in it, forward fill would be a good choice. Forward fill will fill the missing values with the previous data.
  • Use backward fill to fill the missing data. Same with forward fill, this method would be a good choice if we're dealing with data that has a pattern in it. Backward fill will fill the missing values with the next data.
  • Use machine learning algorithms like linear regression to predict the missing values. we can build a simple linear regression model that can predict the value of the missing values using other features.

If we have categorical data:

Use mode to fill the missing values. Mode will fill the missing values with the most frequent categorical value in the data.”

Data Engineer vs Data Scientist: Conclusion

The above 'Data Engineer vs Data Scientist' comparison showed you there are more similarities than differences between data scientists and data engineers.

Data scientist is the most general job title encompassing all the knowledge and skills you need to have if coming from a data science background.

Data engineers are data scientists focused mainly on one particular aspect of data science: handling the raw data and data infrastructure.

While the data engineer is slightly less paid in general, both jobs are highly paid and require specific skills. The job interview questions usually test similar technical aspects of the job. The main difference is data scientists will have to answer more statistics and modeling questions. On the other hand, data engineers will have to show more understanding of system design. This can also be reflected in other question categories, which will focus on the system design aspect of data science.

Now you can more easily choose between those two career paths, deciding on the one that is closer to your interests and skills.


Data Engineer vs Data Scientist


Become a data expert. Subscribe to our newsletter.