Interviewing a Data Scientist at Kaggle, Rachael Tatman
Disclaimer: Rachael does not speak for the employer, Kaggle. These thoughts and opinions are solely their own.
We interviewed an enthusiastic data scientist at Kaggle, Rachael. Rachael answered our questions, we hope that you will enjoy reading and learn more about the data science field. I am not introducing Racheal’s background here in detail and leaving you with the interview in order not to steal your time. Here is their LinkedIn in case you’re curious.
Despite being a hot topic for several years, there is still no unique description for the data science. What is the data science from your point of view?
Data science is a very young field that I would probably actually consider three separate subfields. One is data analysis, looking at data to understand and extrapolate underlying patterns and communicate what you’ve learned. The second is data engineering, which is building the infrastructure to move, store and transform data so that it’s in the specific format needed. I’d personally fit most data cleaning into the “data engineering” bucket. And the final subfield is machine learning engineering, building systems to automate some task by training models and putting them into production. At the moment, most data science roles generally require some amount of work in multiple subfields. For example, you might be given a dataset that needs a lot of restructuring and transforming, or data engineering, in order to get it ready for data analysis. Then, once you’ve done enough analysis to figure out what a good metric to target is and what features are informative, you might use that knowledge to build a machine learning system.
A few years ago, almost every job opening required to have Ph.D. Having said that a Ph.D. perhaps helps a lot, yet, do you consider that every data scientist needs a Ph.D. as of 2019?
Nope! A Ph.D. can help you learn how to ask questions and work independently, which can be helpful in a data science role. It will also give a very rich depth of knowledge in an extremely narrow topic, which isn’t likely to be helpful in a data science role. At least in the US having a Ph.D. can help you get past the very first round of resume culling but won’t help you much beyond that. I do enjoy being able to pick the “Dr.” title on forms, though!
Which subjects(math, statistics) do you think as the most important part of data science?
I suppose it depends. In terms of math, neural networks rely heavily on linear algebra so it can be helpful to have some background there. I also think almost anyone working with data will benefit from a good understanding of statistical inference, including both frequentist and Bayesian approaches. That said, I personally also think some depth of knowledge in a field outside of math is very helpful when working with data. Working with qualitative data in particular can help you learn how to ask clear, answerable questions. Anything from history to literature will inform how you approach new problems and help you learn new ways of identifying and describing patterns.
As coming from the linguistic background, how did you learn mathematics and statistics? Do you have any non-ordinary suggestions?
Honestly, I learned a lot of math and statistics as part of my linguistics training. Semantics includes formal logic and lambda calculus, phonology includes set theory (although with different notation), phonetics includes acoustics and signal processing. I got very comfortable with rigor and formal modelling during my core linguistics training, and I also got statistics training in design experiments and analyze the data I collected.
I think the main takeaway for me is that it’s much easier to learn and understand formal models if you can tie them to something real that you’re passionate about. Whether it’s phonological variation, chocolate, cricket scores or customers at your grandparents’ restaurant , I’d recommend starting with data you really care about.
Which programming language do you prefer to use in your daily job? If you had a chance to learn a different language, which one would it be?
It depends on the task. For tabular data, I generally prefer R and the Tidyverse collection of packages. There are just so many handy time-saving functions. For just general day-to-day programming, though, probably Python.
As for learning a new language, maybe Perl? There’s a lot of NLP legacy code in Perl and for a lot of tasks it’s still much faster than Python. That said, I find it genuinely awful to try to read Perl code and I don’t imagine I’d find writing it that much easier, so I’m not really in a rush.
I know you have a solid experience now, however, there are many great folks who are getting ready to step on the data science field. Let’s assume that you are going to mentor yourself back in the time. What would you suggest yourself to follow in order to learn data science?
Don’t feel like you have to learn everything right away. Once you’ve done some introductory courses, come up with a small project on a topic that interests you and will keep you excited. Having a specific question I’m trying to answer or a particular thing I’m trying to build helps guide my learning. Plus, if I have an application in mind I tend to remember things I read much better!
Do you have any suggestion for the ones who are getting prepared for Data Science interviews and already know the field?
Be prepared for anything. There isn’t really a set format for a data science interview yet. I’ve had interviews that were all software engineering type questions (no math or statistics) and interviews where I asked to derive gradient descent. Try to read widely when you’re studying for interviews. It’s also perfectly reasonable to ask whoever’s scheduling the interview what sort of questions you can expect in order to focus your studying. And be nice to yourself! I try to schedule something fun and distracting, like a trip to a park or reading a new book, right after an interview so I can get my mind off of it and help reset.
How does a typical successful Kaggle profile look? What are the advantages of having a successful Kaggle profile while applying for a new position? What about the over-killers, do they really receive job offers without any efforts?
There are many ways to have a successful profile. Some people focus on one type of competitions, others write in-depth kernels and still others are helpful on the forums and jump in to help answer people’s questions. I personally enjoy going through the profiles of people who write kernels that tell a story and have some clear supporting text to help me follow the code they’ve written. It’s also nice to see people who work on datasets that don’t already have a ton of kernels on them; I’m much more intrigued by someone who writes a time series analysis of highway traffic in Uruguay using a new dataset someone uploaded then by someone who writes kernels on just the most popular datasets. That’s personal taste, though.
I don’t think anyone receives a job offer based on Kaggle work without effort. It’s pretty rare for a recruiter to reach out directly to someone; most of the people I talk to who got jobs through Kaggle got them by meeting someone through the forums or by going to a meetup or something like that.
Data science is a broad field. Some may join a project that they may not like. Eventually, they will want to change the job to another subset of data science. Let’s assume you are the hiring manager for the candidate. The candidate applies to your position. Do you think that the candidate has an advantage over the over candidates that have the title software engineer or statistician?
I personally think titles aren’t generally as important as relevant experience and a willingness to grow and learn new things. If I were hiring someone for a data science position, I’d put more weight on projects they’d worked on previously, whether professionally or as a hobby. That might be a pretty US-centric answer, though; I’m not really sure how the data science job market works outside the US.