There was a post on KDNuggets yesterday entitled 20 Questions to Detect Fake Data Scientists by Andrew Fogg, and after reading the questions, I had to wonder what is “real” data science. All of the 20 questions in this article focused around statistics/machine learning or data visualization, and even the stats questions seemed to be very focused on particular areas of emphasis. I would argue that this blog was an excellent example of Mirroring Bias or in other words: I am a data scientist, and these are all the fundamental skills which I deem important, therefore in order for me to deem you worthy of the title Data Scientist, you must have these skills.
Here are the questions:
- Explain what regularization is and why it is useful.
- Which data scientists do you admire most? which startups?
- How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.
- Explain what precision and recall are. How do they relate to the ROC curve?
- How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything?
- What is root cause analysis?
- Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples.
- What is statistical power?
- Explain what resampling methods are and why they are useful. Also explain their limitations.
- Is it better to have too many false positives, or too many false negatives? Explain
- What is selection bias, why is it important and how can you avoid it?
- Give an example of how you would use experimental design to answer a question about user behavior.
- What is the difference between “long” and “wide” format data?
- What method do you use to determine whether the statistics published in an article (e.g. newspaper) are either wrong or presented to support the author’s point of view, rather than correct, comprehensive factual information on a specific subject?
- Explain Edward Tufte’s concept of “chart junk.”
- How would you screen for outliers and what should you do if you find one?
- How would you use either the extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
- What is a recommendation engine? How does it work?
- Explain what a false positive and a false negative are. Why is it important to differentiate these from each other?
- Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)? (From 20 Questions to Detect Fake Data Scientists by Andrew Fogg)
The trouble is that data science is interdisciplinary–a mixture of domain expertise, computer science and applied mathematics. Therefore, “true” data scientists have expertise in all three disciplines that make up data science. However these questions completely virtually ignore the domain and computer science disciplines to say nothing about big data, unstructured data etc.. For example, if someone were to ask these questions of an expert in computer vision, that candidate might do poorly because their skills–which are certainly in the realm of data science–do not fall neatly on this list. Likewise for someone who is an expert in streaming text analytics.
If I were going to construct such a list, I might take about five of these questions and add questions like:
- What are the advantages of NoSQL systems compared with traditional databases?
- Which tools do you use to manipulate data? Why?
- How do you determine the efficiency of an algorithm?
- Explain some common methods for analyzing free text.
However, I would not construct such a list in the first place. The bottom line is that virtually any data scientist could probably come up with a list that would mis-label other data experts as fakes simply by asking questions about their weak areas.
Data Science is about Solving Problems
Data Science is about extracting useful and actionable information from data. As such, when I interview people the most important thing for me is the candidate’s problem solving and analytical abilities. I’ll pick people to interview who have a background which would lead me to believe that they would have experience in the data science realms, and then ask them to solve open ended problems. My real interest is not whether they arrive at a solution, but rather to see how they think about problems. A good (or “real”) data scientist will be able to identify the problem and use the skills listed above to solve the problem and therefore there is no need to pepper the candidate with a pop quiz of stats questions.
There is a great talk from Daniel Tunkelang about hiring data scientists in which he discusses his process of hiring data scientists and comes to a similar conclusion.
Let’s use a more positive, less exclusive term
Since data science covers so many areas, I think we can take it as a given that virtually nobody can truly be a master of everything. Therefore, perhaps instead of using the label “Fake Data Scientist”, I would use the label “Novice Data Scientist”, or “Junior Data Scientist“. I believe that everyone has the capacity to grow and learn new skills. We weren’t all born with an understanding of deep learning or Markov chains and if someone lacks certain requisite knowledge, instead of labeling them as a phony, it is a better approach to view that candidate as a beginner who needs additional skills and provide them with suggestions as to how to acquire those skills.