February 2016 – The Dataist

There are a lot of bytes floating about the internet making the case for who is a data scientist, including a whole subset of posts about who are “real” data scientists. I have observed that this philosophy manifests itself in data science training programs as well. Right now, if you want to become a data scientist, it’s an all-or-nothing proposition. I would place the available programs in three categories:

Online Training from Companies such as Coursera, Udacity and others
12-14 week bootcamp-style data science courses
University programs

Many of these programs look amazing and if I had the time I would love to take one (or all) of them. With that said, all these programs take the approach that the students will emerge from these programs as data scientists and as result, they teach a comprehensive set of skills to their students.

A Different Approach

There are certain professions in which you either are a member or not, and non-members practicing the profession could have disastrous consequences. Medicine for instance. You either are a medical doctor or you are not, and when individuals who are not qualified to practice medicine are dispensing medical advice, the results are often disastrous–as best illustrated by the anti-vaccine movement which at this point is being pushed by many celebrities rather than legitimate medical professionals. Engineering would be another example. You wouldn’t want to drive over a bridge that I designed because well… I have no idea how to do it.

However, these professions all have professional associations and government regulators who determine who has the requisite skills to call themselves members of that profession. This does not exist for data science, for good reason. Data Science as practiced today, is a mixture of several other disciplines, and can be applied to nearly any other discipline. The breadth of skills that fall under the data science umbrella is so enormous.

In 2013, O’Reilly published a short booklet entitled Analyzing the Analyzers in which they identified four main groupings of individuals who considered themselves data scientists which were:

Data Creative
Data Developer
Data Researcher
Data Businessperson

You can see from the graphic on the left that the skill breakdown is far from homogenous. Data Science differs from other professions in other professions–such as medicine–practitioners MUST have a core set of knowledge in order to be a member of that profession. As illustrated in the chart on the left, that is not really the case for data science. One could be a machine learning master, yet have no experience with big data and still legitimately be said to be doing data science work. Therefore, I’d like to propose that data science be viewed as a spectrum of skills rather than a binary condition.

In short, I believe that data scientists spill too much ink trying to label individuals as individuals (or as “fake” data scientists). Instead data scientists should spend time determining what skills actually make up data science.

Teaching Data Science Skills Rather than Data Science

This slight twist on the approach has implications for training. If data science is no longer an exclusive club, but rather a collection of skills, those skills, or portions thereof can be taught to individuals who have no interest in becoming data scientists. In other words, data science training could be about teaching anyone who does analytic work data science skills which they can incorporate into their workflow. The goal of this kind of training would not be to mint full data scientists, but rather, teach individuals data science skills relevant to their profession. This kind of training could be a lot shorter and less comprehensive, but in the end, I believe that it would be more practical for the thousands of individuals out there seeking to incorporate data science into their work but don’t really have the time to put into it or the desire to become data scientists. Mind you, I am not proposing dumbing data science down, but rather I am suggesting that in addition to what is currently offered, data science training can be effective if it is tailored to specific audiences, and focuses on the techniques that would be directly relevant to those audiences.

TL;DR

The data science world today views data science as a binary condition: either you are or not a data scientist. However data science should be viewed as a collection of skills which virtually any professional can incorporate into their workflow. If data science training was viewed in this context, organizations could increase their use of data science by training their current staff in the data science skills that are relevant to their work.

1 Comment

chalkboard I spend most of my time now teaching others about data science and as such I do a lot of research into what is going on with respect to data science education. As such I decided to take an online machine learning course and it led me to a serious question: why don’t we use pseudo-code to teach math concepts?

Consider the following:

This is the formula for Residual Sum of Squares, which if you aren’t familiar, is a metric used to measure the effectiveness of regression models.

Now consider the following pseudo-code:
residuals_squared = (actual_values - predictions) ^ 2 RSS = sum( residuals_squared )
This example expresses the exact same concept and while it does take up more space on the page, in my mind at least, is much easier to understand. I don’t have any empirical data to back this up, but I would suspect that many of you would agree.

Greek Letters are Jargon

Another thing I’ve realized is that part of the reason math becomes so difficult for people is that it is entirely taught in jargon, shorthand, and shorthand for shorthand. The greek letter sigma represents a sum, but if you don’t know that then it represents confusion. If you aren’t familiar with this formula, then the other Greek letters could be meaningless, yet if we used pseudocode, any part of this formula could be rewritten using English words (or any other language) and thus easily understood by anyone.

I’m working on developing a short course in Machine Learning called Crash Course in Machine Learning which I will be teaching at the BlackHat conference in August. I’m curious as to what people think about presenting algorithms using pseudo-code instead of math jargon. I suspect it will make it easier for people to understand without diluting the rigor.

5 Comments

Month: February 2016

On Data Science is Not A Binary Condition

A Different Approach

Teaching Data Science Skills Rather than Data Science

TL;DR

Teaching Data Science in English (not in Math)

Greek Letters are Jargon