Skip to content

The Two Most Important Skills for a Data Scientist

unnamedI saw an article a little while ago on LinkedIn (which at the time of writing I cannot find) but the basic premise of the article was that problem solving was the most important skill for data scientists to be effective at their job.  (If anyone can find the article, please send me a PM as I’d like to credit the author.)   The article stood out to me because most articles that begin with “Top x Skills for a Data Scientist” usually feature some list like this one:

  1. Education
  2. SAS or R
  3. Machine Learning
  4. Advanced Statistics
  5. Python
  6. Hadoop
  7. SQL
  8. Unstructured Data
  9. Intellectual Curiosity
  10. Business Acumen
  11. Communication Skills

While many people like to focus on various technical aspects of data science, such as data engineering, or machine learning, at the end of the day, if you are unable to apply these skills in a practical way towards real life problems, you will not be an effective data scientist.  In my view, effective data scientists must be masters at both problem solving and critical thinking.  There are clear relationships between the two but I think it is fair to say that there are differences in that not all problems require critical thinking and vice versa.

Data Science is the Union of Critical Thinking and Problem Solving

To illustrate this union, there is a famous example from WWII.  In an effort to minimize the number of bombers shot down over Europe, the Center for Naval Analyses (CNA) had conducted a study of the damage done to aircraft that had returned from missions, and had recommended that armor be added to the areas that showed the most damage.  This would seem like a relatively easy optimization problem…put more armor where there are a lot of bullet holes.  Indeed, this was the conclusion at which the CNA arrived.

However, a mathematician named Dr. Avraham Wald was given the same data consisting of how bullet holes were distributed across aircraft which returned from sorties and realized that there was a problem with these conclusions.  Dr. Wald realized that this study only considered aircraft which had returned from sorties and not those which had been shot down and therefore, the correct conclusion was to put additional armor where there was little damage.  In effect, the bullet holes demonstrated that the aircraft could take damage in those areas and still fly, whereas areas where there were few bullet holes demonstrated the opposite.   The Allied forces adopted this conclusion and significantly improved their survival rates.

To me this demonstrates the perfect blend of problem solving and critical thinking.  The original analysis, was simple problem solving.  Someone was tasked with figuring out where to put more armor.   They did some analysis, came up with an answer and–in their eyes at least–solved the problem.  Unfortunately, the conclusion was completely incorrect.  The critical thinking came in when Dr. Wald asked the question of “What is missing in this data?” (Or most likely “Was in dieser Daten fehlt?“) and ultimately led Dr. Wald to the correct conclusion.

Applying Critical Thinking to Data Science

My first real introduction to critical thinking came not in the university, but rather during my first few months working as an intelligence analyst at the Central Intelligence Agency.  At the time, after about four months, new analysts would be pulled out of their offices and enrolled in a four month long, analytic training program.  I can honestly say that I learned more about analysis in this program, than I did in my entire college career.   (Unfortunately, this says a lot about the state of higher education, but we’ll leave that for another time)  While unfortunately, I cannot get into the details, some of the most interesting lessons centered around examining intelligence failures, and how when critical thinking is replaced by “group-think” or other cognitive biases, the results are usually not good.  The course forced you to look at situations with a critical eye, examining all assumptions and putting hypotheses through an extremely high degree of rigor.  I believe that the thought processes and critical thinking skills I learned at the Agency are what enabled me to approach data problems with a unique perspective and as a result, come up with effective solutions.

Many people ask me how they can become a data scientist, and while I can recommend various courses that will help you develop your technical skills, the application of these skills–and ultimately your success as a data scientist–ultimately depends on your critical thinking and problem solving abilities.  Can you look at an analysis and find the logical flaws?  Can you think critically about problems?  In the classes which I teach, I always try to infuse critical thinking into the exercises so that students must think about their answers and ask things like “Does this answer make sense?” etc.  (Naturally, I’ll put in a plug for my upcoming classes at BlackHat in Las Vegas in August: Crash Course in Data Science for Hackers, and the Crash Course in Machine Learning)

Learning Critical Thinking Skills

Hopefully, by now I’ve convinced you of the need for critical thinking to be a part of a data scientist’s toolkit. You might be asking yourself, how do I hone this skill?  Well… stay tuned, as that sounds like the subject of my next post!

Share the joy

3 Comments

  1. George Barckley George Barckley

    Nailed it. The tech stuff can be googled…problem solving and critical thinking can only be learned through experience.

  2. Jerome Jerome

    Thank you for the insightful article. According to Kirk Borne (Chief Data Scientist at Booz Allen), two of the 7 Cs of a data scientist are a courageous problem solver and a critical thinker.

  3. Very important principles elegantly stated! I recognize the problems this article addresses in clinical research articles all the time. Independent variables are more likely to be the “usual suspects” rather than relevant to the particular problem addressed, while dependent variables are often a shot in the dark. Very often, unnecessary data reduction is performed to allow categorical analysis (probably because it’s easier to crank through.) Perhaps the biggest single error in judgment I see is consideration of relative risk without looking at absolute risk, which may be causing a lot of patients to get aggressive treatment for a minor or nonexistent problem.

    Another, more obscure, but important, problem is overlooking easy-to-gather diagnostic information because the disease it could point to is “too” rare statistically: example: a local man who had been subjected to low-grade arsenic poisoning over a long period of time was improperly diagnosed in the emergency room, perhaps because his fingernails weren’t examined, so he had to die before they could nail his murderer.

    I really hope that this eager dash into “evidence-based” medicine won’t result in fixing what ain’t broke.

Leave a Reply

Your email address will not be published.