Something is Rotten in the State of Data

I’m writing this blog post in the departure lounge at Heathrow, on my way back from Strata + Hadoop World, London. Whilst at Strata, speakers kept coming back to the idea that an ever growing number of large businesses are not really happy with the investments they have made in analytics and data science. One of the speakers quoted a Forrester 2016 survey which claimed that:

29% of firms are good at translating analytics results into measurable business outcomes
-20% change in satisfaction with analytics initiatives between 2014 and 2015
50% of firms expecting to see stagnation or a decrease in big data/data lake investments in 2016.

These are very disappointing numbers, however not completely unsurprising. Using the “5 why” technique an intra-management dialogue might go something like this:

Our company has not been to able generate value from our data analytics group.

Why?

The data analytics group does not really seem to understand the business and what problems we face.

Why?

Well… when we were building our data analytics group, we did what everyone else was trying to do and found really smart data scientists with Ph.Ds in physics or advanced math and gave them complete autonomy over what they were working on.

Why did you do that?

We were under the impression that you had to hire these so-called unicorns or people who had experience in all aspects of data science in order to build a good team. However, even though we paid them very well, several key members left the company after a few months.

Why did they do that?

Our existing staff really doesn’t understand data or know how to use these latest tools, so we wanted to try to get the most qualified people to work our problems.

So let’s examine some of the issues exposed here. The first issue which screams out to me is the staffing. In this instance the notional company hired individuals who had some of the requisite skills that people believe data scientists must have: math and scripting. To make sure they got good people, they hired people with Ph.Ds in Physics. The problem with this approach is that an advanced degree in physics or math does not necessarily prepare you for the application of these skills to a business problem.

Now I know I’m going to get some flack for that statement, but in my defense, I would offer the following. First, engineering firms do not hire people with math or physics degrees. Even though the skill sets are similar–and often employ the same techniques, someone with advanced engineering training will have a better understanding of how to apply these techniques. This is the same in other disciplines. You cannot practice medicine unless you have a medical degree, no matter how much biology or anatomy you have taken. Yet, for data science, there is this perception that a Physics or Math Ph.D qualifies you to be a data scientist. To be fair, there are many individuals with that background who have been successful, however, this scenario leads me to the second major problem in data analytics which is that this notional company is ignoring the third circle of the data science Venn diagram: domain expertise.

Don’t Underestimate the Importance of the Third Circle

Data Science Venn Diagram I’m going to use Drew Conway’s famous (or perhaps infamous) Venn diagram to illustrate the problem. What the Physics/Math Ph.D is lacking is the bottom circle–substantive or domain expertise. I have personally witnessed this phenomenon in our own client engagements–individuals calling themselves data scientists with strong math/physics backgrounds who are unable to translate that knowledge into real value for the client. Most often, it is due to the fact that they lack a deep understanding of the problems that the client is facing. (As an aside, I’ve also heard stories of colleagues being turned off by data scientists’ arrogance in believing that they knew more than the domain experts.)

An alternative approach, is to build a data analytics/science capability from extant staff. Using this approach, you can start with individuals who have extensive domain knowledge, and most probably some limited math skills, and teach them relevant data science skills needed to perform advanced analytics.

Over the last few years, I have been involved in developing training programs for this scenario–domain specific data science training for working professionals. It has been interesting to see how graduates from our training have gone on and used their newly acquired skills and have been able to create real value for their clients. Now these individuals, certainly do not have the math/stats background of someone with a Ph.D, but once they have the coding skills and some math/stats, their intimate knowledge of their domain enabled them to do meaningful work. What’s more is that learning these new skills made them enjoy their jobs more as they were suddenly able to accomplish much more than they were before.

The bottom line here, is that many companies overlook an obvious starting point for a data science capability: their own staff. I would argue that it can be just as effective if perhaps more so, to train existing staff in data science techniques because they know the problems that the business is facing.

Mind the Gap Between Hype and Reality

Mind-The-Gap Our final problem in our notional company isn’t quite as obvious, but essentially the company management essentially doesn’t understand data. This unfortunate problem is exacerbated when data illiterate people buy into the hype and start buying stuff with little planning or strategy as to what they hope to accomplish. The problem as I see it is not with the technologies nor with the salespeople who are trying to promote data products. The problem lies in management who are data illiterate. Unfortunately, there are very few training courses in data science and big data that are designed for managers to teach data literacy. Oh, there’s plenty of fluff and BS available, but very little in the way of real education for those in management positions seeking to build data-driven organizations.

What can be done?

In the end, the reality is that data science can contribute enormous value to companies that understand it. The solution to this is two-fold:

Managers need to become data literate and develop a much greater understanding of the power which they are trying to unleash. They will need to understand how to build, equip and maintain their data science teams so that they don’t lose their investment. On the other side of the equation, the data science community needs to start placing a greater value on that area of domain expertise. We should be building and delivering training targeted towards working professionals to teach them data science skills (such as my Crash Course in Data Science for Hackers) as well as longer programs to teach someone how to become a data scientist from scratch. We should also not forget that data science is much more than machine learning, and ETL. Data science is about solving business problems using data, and as such data scientists also absolutely must have visualization and communication skills necessary to make managers understand their data products. If we fail to do this, the data revolution will wither on the vine, but if we succeed, I’m sure that more and more industries will realize the amazing value that data can bring to their businesses.

Share the joy

Something is Rotten in the State of Data

Don’t Underestimate the Importance of the Third Circle

Mind the Gap Between Hype and Reality

What can be done?

Related

2 Comments

Something is Rotten in the State of Data

Don’t Underestimate the Importance of the Third Circle

Mind the Gap Between Hype and Reality

What can be done?

Share this:

Related

2 Comments