How to Start Learning Data Science from Scratch
If you don't know how to start learning data science from scratch then you're at the right place to get ready for industry.
Data science has become a dream job for many of us but it's a challenging part to understand how and where to start learning data science. Many of you think that you need to have an undergraduate in engineering or background in stats/math, and undoubtedly it will be beneficial but let me tell you that it is not necessary. Basically, we can say that there are three ways to start learning data science:
- A bachelor or master in data science
- Get into a bootcamp program
- Learn it by yourself
Everyone’s journey and backgrounds are different, so I’d like to toss mine experience into the mix. In this article, I want to focus on resources you all can use to improve your technical skills and even more specific resources to get your first job in the field of data science because that’s the hardest part -- getting your first job. Once you have that, you’ll learn the skills you need so fast, that you don’t need people like me giving you advice.
Technical Concepts that You Need to Learn for a Career in Data Science
It’s really hard to learn data science and actually be good at it. I never went to school specifically for data science, but I do have a technical background. Even then, it took me a while to gain all the skills I needed to be competent. And that’s because there’s a long laundry list of things you do need to learn to be a data scientist. The major topics are:
- Data analysis (LeetCode, StrataScratch)
- Machine learning (projects)
- Traditional statistics
- Theory of your machine learning models (double-dip)
- Product sense/business cases
If you look at these three topics, it represents three different professions, right? It's like three professionals - a software engineer, a mathematician, and a business person like an MBA.
How do you start learning data science from scratch to be proficient enough to actually land a job? Let’s take a deeper dive into these topics.
Resources to Start Learning Data Science from Scratch
Let’s get into these three topics by compiling the top resources to learn data science that can help you improve your skillset.
In data science, programming is probably the hardest and most time consuming to learn. What’s hard about programming isn’t learning the syntax of say SQL and python, it’s actually about how to approach solutions and implement them.
Within programming, you have data analysis and machine learning.
Data analysis is all about being able to pull and manipulate data, and generate insights and recommendations. You’ll need to know both SQL and another scripting language usually Python or R.
Everyone’s going to tell you to do projects to get better, and I agree with them, but let me give you another piece of advice -- try doing interview questions to get better. What better way to succeed in an interview than by doing a ton of interview questions to get better at data analysis. The main benefit here is that you’re solving problems that are relevant to data science industries and companies. So, when you’re interviewing, you’ll basically be ready and able to answer most questions easily because you’ll have mastered the necessary technical skills that companies want you to have before working for them.
There are so many platforms out there that can help with interview coding practice. The most popular is LeetCode. You probably know this already. But LeetCode is tailored for software engineers so take that with a grain of salt when you’re doing the problems. There’s also StrataScratch, which is something I built that I designed specifically for data scientists. That can help as well.
For data analysis, I’d suggest to learn and master both SQL and either python or R and do as many relevant interview questions as possible to understand what companies are looking for in candidates as well as mastering your technical coding skills.
Machine learning, specifically implementing machine learning models, is another programming skill you need to learn in data science.
You’ll usually need to know python or R well, and understand the data science workflow to build and implement these models. This is where I’d recommend doing projects. There are so many places where you can find projects. One of the most popular platforms or resources is Kaggle. Find a project there, grab the dataset, install jupyter notebooks, and do the project and try talking to people to see what you can do to improve.
Another resource is confetti.ai which has a bunch of ML type questions to help you get better at implementing machine learning models. They have a ton of practical examples that require coding as well as theoretical questions to help you understand what the model’s actually doing.
Learning how to implement machine learning models is probably where I’d spend most of my time in learning data science, to be honest. And it’s not because you’ll be implementing ML models every day as a data scientist, it’s actually to learn the data science workflow in terms of pulling data, manipulating data, feature engineering, model implementation, model optimization, and recommendation. Being good at that workflow, understanding why you’re making certain decisions, and why you’re making a recommendation is something you’d do every day on the job and you need to be good at it. This topic takes a long time to get good at.
2. Statistics / Probability
The second technical topic to learn in data science is statistics and probability. Data science is statistics in a nutshell. If you’re implementing an ML model or regression, designing experiments, then you’re an analyst, not a data scientist.
Everything is statistics, so let’s break it up into how to better understand statistics for data science.
I just talked about implementing ML models, right? So, what are the ML models? They’re just statistical models. And as someone that builds them, you’d want to learn how they work. For me, as I was doing projects and building out my models -- ML and even regression models -- I was reading about the underlying theory and math about these models. And it allowed me to better understand the underlying assumptions of the model, which helped me better clean my data and design my features, which helped me in turn develop more accurate models. Interviewers are 100% going to ask you all about ML and regression theory because if you don’t know why you’re doing what you’re doing, then no one can in turn trust any of your results and recommendations.
Resources to Learn Machine Learning and Regression Theory for Data Science
So where do you go to learn about ML and regression theory? The best resources I’ve found are through google searches that might take you to Medium or Wikipedia or some other authoritative site. You read a bunch of articles and you get a little better understanding of the underlying theory.
Resources to Practice Traditional Statistics & Probability
One site, I used to use a lot for interview practice specifically, is Brilliant.org. This site is good because their questions are similar to questions you might get on a data science interview. Just like you’d use LeetCode or StrataScratch to get better at programming, you can use Brilliant.org to get better at statistics and probability.
In summary, learning statistics and probability is a matter of:
- Learning the theory behind ML models and regression
- You can do this through projects where you’re implementing models and reading about the underlying theory of each model
- Getting good at interview questions
- You can get better at this through platforms that specialize in stats like brilliant.org
3. Product Sense / Business Cases
The third topic to learn in data science is Product Sense. This is a non-technical concept that you’ll need to learn to be a data scientist.
What is product sense?
It’s similar to product management (obviously not quite the same) but it looks at the problem and makes decisions through a business lens.
It deals with questions like:
- How would you measure the success of different parts of the product?
- How would you tell if a product is performing well or not?
Why do you need to know this information as a data scientist?
Because it helps you figure out how to approach and analyze a problem, in order to make a recommendation to solve the problem. If you’re not optimizing for the business/product, you’re optimizing for your model, and you don’t need to make your model 100% accurate to drive business impact.
How do you get better at product sense?
For me, it was reading product management case studies to understand how PMs think and make a decision. There are case studies for this but there are also videos and platforms. YouTube has a ton of PM videos and there’s also a popular channel and platform called Exponent where you can learn a lot about PM. PM skills translate very well to data science product sense. So, reading and watching PM videos helps you gain your skills here.
Another option is reading questions off Glassdoor and seeing other people’s responses. That wasn’t the best option because of the quality of the responses but it’s a free option.
There’s a lot to unpack here to understand how to start learning data science from scratch. To summarize, I learned data science by breaking down the topics by
- Product sense
Within both programming and stats, you should have an understanding of machine learning and regression, how to implement it, and the theory behind it. You can say there are 3-4 different topics that a data scientist should know. It’s hard and takes a while to get good at it.
Take a look at the resources, I recommended. I’ve used all of them in the past and have thought they were valuable in my journey to becoming a successful data scientist.