HOWTO: Get Started in Data Science
Let’s deconstruct the data science umbrella and see how to get started.
The What
The exact responsibilities of a Data Scientist will depend on the company, but I like the following categorization by Cassie Kozyrkov:
- Data Analytics
- Machine Learning
- Statistics
Below is a brief overview of each role. Ideally, each position is filled by a dedicated specialist, but often responsibilities are somewhat intertwined.
Data Analyst
Explores the data for actionable insights. A good one does so quickly, does not jump to conclusions and effectively communicates the findings up the chain of command.
Machine Learning Engineer
Builds statistical and/or algorithmic data analysis models. These models typically automate or simplify human work.
Statistician
Conducts a rigorous analysis to back important decisions. This typically includes checking technical assumptions, hypothesis testing, examining the data collection process, and so forth.
The How
I assume you are starting from scratch. As in you have just graduated from high school and know nothing about programming or statistics.
There are two main approaches when learning anything. Starting from the bottom or from the top.
Bottom-Up
The traditional way of starting from the fundamentals and low-level concepts. This approach allows you to build a strong foundation, but can be daunting as you lose the big picture and can’t understand how everything pieces together.
Top-Down
Start from the top, get a quick overview of the field. Then try things out, learn how to use the essential tools of the trade. And only then you can dive deep into the technical details of how and why things work.
Stirred, not shaken
Unless you are in the habit of going down into every rabbit hole you encounter, the top-down approach is both more fun and effective.
Take your time or iterate (going deeper and deeper), but always learn the fundamentals properly. As I put in the deliberate misquote of double-oh-seven, your fundamentals should be stirred, not shaken by fun.
- Start from the top, get an overview.
- Do fun things.
- Learn the fundamentals and do the problem sets.
The Where
A list of specific recommendations to get you started. The list is not exhaustive by any means. In particular, I have resisted the desire to include more computer science and software engineering material (these probably deserve their own posts). It also has more emphasis on the Machine Learning and Statistics tracks due to my personal experiences, read more about Data Analytics here.
Introduction to Programming
Introduction to Data Science
- MIT-6.0002: Introduction To Computational Thinking And Data Science
- Introduction to Statistical Learning with R (highly recommended, now also with Python)
Tools & Practice
Fundamentals
- MIT-18.01SC: Single Variable Calculus
- MIT-18.02SC: Multivariable Calculus
- MIT-18.06SC: Linear Algebra (mixed feelings)
- All of Statistics (highly recommended)
Conclusion
There is an abundance of learning resources online. This is great, but can also be a curse putting you into a perpetual state of indecision. Start somewhere, start anywhere.