There was a post on KDNuggets yesterday entitled 20 Questions to Detect Fake Data Scientists by Andrew Fogg, and after reading the questions, I had to wonder what is “real” data science. All of the 20 questions in this article focused around statistics/machine learning or data visualization, and even the stats questions seemed to be very focused on particular areas of emphasis. I would argue that this blog was an excellent example of Mirroring Bias or in other words: I am a data scientist, and these are all the fundamental skills which I deem important, therefore in order for me to deem you worthy of the title Data Scientist, you must have these skills.
Here are the questions:
- Explain what regularization is and why it is useful.
- Which data scientists do you admire most? which startups?
- How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.
- Explain what precision and recall are. How do they relate to the ROC curve?
- How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything?
- What is root cause analysis?
- Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples.
- What is statistical power?
- Explain what resampling methods are and why they are useful. Also explain their limitations.
- Is it better to have too many false positives, or too many false negatives? Explain
- What is selection bias, why is it important and how can you avoid it?
- Give an example of how you would use experimental design to answer a question about user behavior.
- What is the difference between “long” and “wide” format data?
- What method do you use to determine whether the statistics published in an article (e.g. newspaper) are either wrong or presented to support the author’s point of view, rather than correct, comprehensive factual information on a specific subject?
- Explain Edward Tufte’s concept of “chart junk.”
- How would you screen for outliers and what should you do if you find one?
- How would you use either the extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
- What is a recommendation engine? How does it work?
- Explain what a false positive and a false negative are. Why is it important to differentiate these from each other?
- Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)? (From 20 Questions to Detect Fake Data Scientists by Andrew Fogg)
The trouble is that data science is interdisciplinary–a mixture of domain expertise, computer science and applied mathematics. Therefore, “true” data scientists have expertise in all three disciplines that make up data science. However these questions completely virtually ignore the domain and computer science disciplines to say nothing about big data, unstructured data etc.. For example, if someone were to ask these questions of an expert in computer vision, that candidate might do poorly because their skills–which are certainly in the realm of data science–do not fall neatly on this list. Likewise for someone who is an expert in streaming text analytics.
If I were going to construct such a list, I might take about five of these questions and add questions like:
- What are the advantages of NoSQL systems compared with traditional databases?
- Which tools do you use to manipulate data? Why?
- How do you determine the efficiency of an algorithm?
- Explain some common methods for analyzing free text.
However, I would not construct such a list in the first place. The bottom line is that virtually any data scientist could probably come up with a list that would mis-label other data experts as fakes simply by asking questions about their weak areas.
Data Science is about Solving Problems
Data Science is about extracting useful and actionable information from data. As such, when I interview people the most important thing for me is the candidate’s problem solving and analytical abilities. I’ll pick people to interview who have a background which would lead me to believe that they would have experience in the data science realms, and then ask them to solve open ended problems. My real interest is not whether they arrive at a solution, but rather to see how they think about problems. A good (or “real”) data scientist will be able to identify the problem and use the skills listed above to solve the problem and therefore there is no need to pepper the candidate with a pop quiz of stats questions.
There is a great talk from Daniel Tunkelang about hiring data scientists in which he discusses his process of hiring data scientists and comes to a similar conclusion.
Let’s use a more positive, less exclusive term
Since data science covers so many areas, I think we can take it as a given that virtually nobody can truly be a master of everything. Therefore, perhaps instead of using the label “Fake Data Scientist”, I would use the label “Novice Data Scientist”, or “Junior Data Scientist“. I believe that everyone has the capacity to grow and learn new skills. We weren’t all born with an understanding of deep learning or Markov chains and if someone lacks certain requisite knowledge, instead of labeling them as a phony, it is a better approach to view that candidate as a beginner who needs additional skills and provide them with suggestions as to how to acquire those skills.
The problem is there are too many people calling themselves data scientists without being aware of their underperformance in certain areas.
They are unconscious of being incompetent https://en.m.wikipedia.org/wiki/Four_stages_of_competence
This article is a crap and meant to sooth people with undergrad and/or masters degrees titled as “Data-Scientists”.
Term “Scientist” is awarded to some chosen people which means something. A scientist (whether a “data-scientist” or any other) cares about answering “Why” before “How”. A data-scientist is more into understanding the very intrinsic nature of data. His/her analytical minds constantly tries to find underlying patterns in data.
Ph.D. means Doctor/Doctorate in Philosophy. It is awarded to those who have demonstrated/proven and have accepted abilities (by well-known scientific community) in a particular field. These abilities are demonstrated not just once, and not just in one way…these are proven over and over in different ways. Some examples to prove these abilities are:
(1) First getting into a reputable Ph.D. program at a competitive university. Just getting there is a big challenge. You need a track record of good undergraduate and/or masters degree, references, etc.
(2) Winning funding from competitive sources, such as, NSF, NASA, DoD, DoE, etc.
(3) Ability to teach undergraduates by performing TA duties.
(4) Being a research fellow/assistant for someone who has a lot more knowledge and experience than you.
(5) Win funding to attend conferences and to present your research-findings before scientific community.
(6) Win funding for scientific workshops, camps, etc.
(7) Pass advanced level courses in your disciplines.
(8) Pass qualifying/comprehensive exams in your areas of research.
(9) Be known to current scientific community AND to know current scientific community.
(10) Develop some meaningful scientific methods and discoveries.
(11) Publish your findings in reputable conferences and/or journals (not some crappy, low level conferences/journals).
(12) Convince scientific and/or federal organizations to give you research grant (which is very big achievement as you are evaluated by your peers anonymously and they trust your abilities by providing your funding coming from tax-payers money).
(13) Graduating with a Ph.D.. It is estimated that only 64% Ph.D. students actually graduate with a Ph.D. degree in the USA. Most drop-out as they are unable to complete above steps. They complete only a few, but not all so they fail. Also, there are only 1% PhDs in the USA. So earning a Ph.D. from a reputable institution under the supervision of a real scientist is a big achievement. It changes your title from Ms./Ms. to Dr. Does that mean something to undergraduates/Masters?
I can continue defining characteristics of a “Real Scientist”. Giving someone a title of “Data Scientist” without him/her having any solid track-record listed above is a joke and an abuse to scientific community. Above mentioned scales are standard in most universities. It means that not everyone has ability to sustain pressure of carrying out research and prove himself/herself. This scale filters out those who do not deserve to be a scientists. It could be due to personal problems or perhaps nature did not give them ability to cross that line which separates a “Real scientist” from a “Non-real scientist”.
It is said that when you can not give someone monetary promotion, give them title (recognition) award so that they calm down and feel better about themselves. Most IT companies now-a-days award you “Data-Scientist” position even if you have an undergraduate/masters degree.
Being able to perform some basic statistical analysis, writing regression/classification/clustering model in R, Python, etc. does not make you a data-scientist. These are just tools. Everyday new tools appear in market. Nothing special about them. You can for sure call yourself a “Data-Analyst”, but trust me, if you meet an actual scientist, s/he will be humored to hear you calling yourself a “Data-Scientist” with an undergraduate/masters degree.
Finally, a real data-scientist does not give a damn about business/profit, etc.. For a data-scientist a data is just a mixture of numbers, characters, sentences, etc. Data-scientist in interested only in finding hidden patterns in data. It is similar to what a patient is to a medical doctor; just a subject who has some known/unknown symptoms. Doctor’s job is to treat those symptoms whether that patient is the president or a criminal.
[…] received the following comment on an article: Let’s Stop Using the Term Fake Data Scientist and thought it merited a response. Usually the comments I receive are constructive even if they […]