Data Science Interview Questions: The stupidly easy, the pretty damn hard, and the lessons in between
Data science is an explosive field with a high starting salary. It is now overflowing with individuals changing their careers to reap the benefits and ride the wave of the “AI” future. A few years ago I was just one of the people.
I came out of a data science bootcamp and hopped right into the interview scene. With a masters degree, some data analyst experience and a well formatted LinkedIn I got plenty of bites. Some went really well, and some didn’t. I’d like to share some of the questions, my experiences and some of the lessons I received from them.
The Easy questions
These are the questions guaranteed to come up, and you should be able to answer . I got asked the following questions multiple times at multiple companies. Those newer to coding can focus so intently on more difficult, machine learning oriented questions they forget to brush up on the basics. Here are some examples:
- What are the basic data types in Python?
This is one of those questions so easy, and so inapplicable to most daily coding you could easily forget. If you’re going to screw up then don’t screw up the freebies!
- What’s the difference between a list and a tuple in Python?
Another fundamental question about data structures intended to separate those comfortable in computer science with those cramming too hard. It’s simple — a list is mutable and a tuple is immutable — but why these questions about data structures are important elude newer computer scientists.
- Anything to do with linear regression
The 4 assumptions of multiple linear regression. The error you’re optimizing for. Just the general concept. These should all be intuitive to you. Linear regression will probably never be used in your job, but as y=mx + b is one of the most basic conceptual items in math, demonstrating you don’t just know but understand these concepts is key.
- Write a function that takes two lists as input and return the numbers that occur in both lists
This is an excellent question to weed out the unprepared and I ask it in all my data science interviews.
Why? Because it has tiered responses.
It’s a simple question with simple answers, however the solution you choose showcases your knowledge of data structures and efficient code. Yeah, you can definitely just loop through both lists but you don’t want to deal with the O complexity on that. You can sort both lists first and your answer gets a little better. Of course, the best answer is to assign all the values in one list to a dictionary (or hash table) and then check to see if your dictionary contains keys from the second list.
Know the answer to this question. Better yet, understand it. It is one of the most common questions and your answer can quickly showcase what kind of programmer you are.
What did I learn from these? First off, I learned I had focused a bit too much on higher level topics as opposed to spending some time just being comfortable in Python and with machine learning. I didn’t get them wrong per say — but just memorising them and spitting them back isn’t the same thing as getting them right. As with any field, it can become sort of obvious who is regurgitating information and who actually understands the information.
Try to understand why you don’t want multicollinearity between your features in multiple linear regression. Try to understand why you should know the difference between a set and a list. These understandings bring you to a higher order of data science.
The Difficult questions
Now for some of the less fun moments from my technical interviews. I’ve definitely been asked some questions that brought me to a screeching halt in my early days. Of course, difficulty is a matter of perspective — right now I could code you an NLP pipeline without googling a thing but if you asked me a SQL question it’s honestly been so long I’d be stumped. Still, here’s some examples of the more difficult things I was asked (as a beginner) in technical data science interviews — and some of the things I was just unprepared for.
- I got asked to basically solve a startup’s data problem
If you’re like me, you have a process. You sit down with your data, google some stuff, and get settled in your domain before diving in. Data science penetrates nearly every industry, so it’s tough to get a grasp of the various nuances across fields without time. It’s especially tough if you’re on the spot in an interview and a company asks you to tackle the same thing their entire data team has been tackling for years and are quite well-versed in.
While I’m assuming they weren’t looking for an absolute answer, the more difficult part as a beginner was visualizing a data science problem in it’s entirety. By entirety I don’t mean just getting the data, visualizing it, and modeling. In addition the ETL, data warehousing, production services and other technical considerations — but all of the business and general problem solving considerations as well.
You can study information as long as you like — but sometimes there’s no real substitute for experience and problem solving ability. They probably weren’t looking for me to say what model I would use, or just to say “machine learning”. ML isn’t always the answer, in fact there are many cases where it is not. Sometimes you just need to solve a problem by thinking creatively!
- Doing logistic regression by hand
For some, this is probably a piece of cake. However, I am not a mathematician. Other than the statistics from my biology degree and some machine learning videos, the last math I had taken was calculus II freshman year of college. Needless to say I did not perform well — I couldn’t even write down the formula. While the interviewers at IBM backed off after that, it was pretty clear I wasn’t the exact fit they had been looking for.
- Code an entire pipeline in Python to process various file inputs into a database and handle errors on the spot. Without any real data
Kind of unreasonable for a entry level data science position right? Easy if it were a take home coding challenge and there were some actual files to test it out on. But, on the spot it can be really tough to visualise every exception you might run into.
I did my best — coded out a little pipeline that would read and parse some files, handle various errors etc…but again this is a situation where experience is key. I could’ve studied Python, machine learning, data structures all I wanted but unless I was working on a specific problem requiring this script — I would have no idea what kind of stuff I might need to consider or run into.
As a newbie — this made one thing painfully clear. There is not just one type of data science position. The modern data scientist still has yet to be well defined. Some data science interviews felt like analyst positions focused on SQL. Some like straight up Python developers. And some as though I was interviewing for a straight up machine-learning researcher position.
While with my current experience this wouldn’t be an issue, at my origin it was disheartening and difficult to say the least. I’d learn some things from one interview thinking I’d be good for the next. But then I would get an entirely different set of questions and topics!
So…what do you do?
You aren’t going to get everything right every time. Especially as a beginner, you’ll almost definitely have some mishaps. Even as a more intermediate data scientist, I’m sure there are plenty of data science interviews I would still fail (probably still couldn’t do logistic regression by hand embarrassingly enough).
All the same, I learned a lot of technical and social skills from being put on the spot over 20 times within a short period. Even if it seems difficult — constantly practice. There is no substitute in a technical data science interview for experience and understanding. No matter how many facts you memorise, being able to jump into Python without feeling lost or confused will always take you further in a technical data science interview.
Knowing the theory and information is important. However, newer Data Scientists focus way too much on the smaller content — the flash card answers. The person who will differentiate from the other candidates shows true command of the material. Not just true command of the textbook answers. I don’t want you as a Data Scientist on my team because you can regurgitate the answers to easy technical questions . I want you because you can actually use and apply that information to solve problems at my company.
Unfortunately, there isn’t really a shortcut. You can’t learn the answers from my selected questions I’ve given you here and ace your next data science interview. I got asked a much larger variety of things than I could write down in this article. Since speaking vaguely is of help to no one, here are some tangible actions you can take:
- Practice Python/SQL/R everyday
- Constantly read data science blogs to see what others are up to
- Do a data science project on your own time
- Actually read the job description to get an idea of what you’ll be asked
- Most importantly, make sure you love what you do. You’ll need to love putting in the effort to get back the reward!
Read also: 8 common mistakes a Data Scientist can do