Data Science for College Students - Courses

2017-10-07

Giving advice to be one of the hardest things somebody can do, and yet people ask for it at an alarming rate. Perhaps the best way to start, then, is with a reminder that life is hard and it’s okay to not know what you’re doing. Your peers will pretend that they have everything figured out, and they will say as much, but know that you’re not seeing their own struggles.

The other caveat I want to set out up front is that advice is contextual. This advice is for undergrads attending a top US university; that’s not a prerequisite for doing data science (at all!) but it’s what I’m familiar with.

Okay, so data science. In general, optimize for learning.

At the undergraduate level, learning for data science usually means to take as many statistics courses as you can. Math is good, as is computer science, but are not as important as people make them out to be. For the most part, you don’t need to derive things, so more math beyond linear algebra is nice but not required. In addition, writing, and learning to write statistical code / data pipeline tasks is more about patience than theory. A minor in computer science is probably enough for you to be comfortable writing code and to have a general idea of programming paradigms.

Learning also means to take good classes, even if you expect a GPA hit. Tech companies don’t care that much about your GPA, and are in general accustomed to lower GPA’s from engineering students. Prerequisites are also rarely hard prerequisites, if you think you can hack the course and are willing to learn what you don’t know as you go, try and take the course. Professors notice if you’re a sophomore in an upper-level course and it’s a good way to stand out.

A lot of schools are starting “business analytics” or “data science” majors, but it’s hard to evaluate them as a group. The good implementations follow my advice above and focus on statistics courses with a few CS courses. A really good major would include a “capstone” project-based or research-based course. I think a lot of these dual majors, though, enable students to take easy classes - every major has easy electives, and when you allow courses from two majors to count, you double the easy electives you can take. This is a bad design from an organizational standpoint, because if you allow some students to graduate with nonsense courses then the major’s reputation will be diluted. This is especially important in the infancy of these majors.

I think an ideal curriculum would look something like:

At least one writing class
- The most important part of data science is being able to communicate your thoughts effectively. You should take the class, and take it seriously.
- Similarly, take a rhetoric class if you can. Data scientists often have to advocate for product decisions, and understanding how to frame an argument is invaluable.
A solid mathematical foundation in calculus and linear algebra
- Linal is more important than calculus, but if memory serves, every college makes you take single variable calculus -> multi-variable -> differential equations -> linear algebra
- I don’t remember what a good calculus book is
- Linear Algebra, Strang
A math-heavy introduction to statistics and probability
- First Course in Probability, Ross
- Introduction to Probability, Feller
A solid course on statistical inference
- Ideally you would work with Python or R in this course.
- Statistics and Data Analysis, Tamhane & Dunlop
- All of Statistics, Wasserman
Course in data analysis
- ISLR is the most important book here; if I was only going to read one, that’d be it.
- Introduction to Statistical Learning, James, Witten, Hastie and Tibshirani
“Just enough programming”
- Take the core CS courses through algorithms
- Algorithms, CSLR
- I guess you could take databases? My strong suspicion is that there are more efficient uses of your time but I haven’t found a good substitute other than ‘working in industry.’
Machine learning
- Some schools (e.g. Berkeley, CMU) are lucky enough to have ML electives, take those if you can.
- Given the choice between an applied ML course and a mathematical ML course, take the math one.
- There’s a ton of books but I don’t know how good they are, I learned from my professor’s (somewhat confusing) slides. This is one area where it’s important to follow a single source - there’s a lot of confusing terminology and people absolutely will mix and match terms.
- When in doubt, read ESLR.
- Elements of Statistical Learning, Hastie, Tibshirani, Friedman
Applications of data analysis
- Every industry has different types of data, e.g. retail has purchase data, pharma has outcomes data, and tech has user event logging. How can we take our techniques and apply them to these types of data?
- Really dependent on your school and what professors are researching
- My favorite at Wharton was STAT476: Applied Probability Models in Marketing.

It would be good to shore up the places in which your curriculum is lacking with self-study.

Did I forget something? Do you disagree? Let me know at hua.christopher+hl[@]gmail.com