November 2016 – The Dataist

I originally drafted a version of this article in August, but decided not to post it because I didn’t want my blog to be political commentary. However, given the shocking election results and the epic failure of the political polling/predictive analytics industries I couldn’t resist sharing my thoughts on the matter. As a data scientist, I have been watching this election with a lot of anticipation and curiosity. Back in August, my original draft was entitled “What happens to Data Science if Trump wins?” and in it, I wrote some thoughts about what the impact would be to the data science world if Trump won. The main premise was that a Trump victory would be disruptive in how political campaigns are run and most importantly, how the analytics used to measure political campaigns would be called into question. I also thought that the value of the super-creepy targeted advertising that Facebook and other social media sites are using might get called into question. But more on that later… Lastly, I’m attempting to write this article without infusing my own political opinions into the central arguments. If I am successful, the reader will have no idea what my political views are.

Ultimately, there are two questions which need answers:

How is it that nearly every reputable news source and polling agency incorrectly predicted the election results?
How can data science be used to avoid repeated errors of this scale?

The first point really bears some fleshing out. It wasn’t just that everyone predicted a Clinton victory, it was that nearly every source–including the vaulted Nate Silver–predicted a massive Clinton victory.

For this discussion, I will presuppose–perhaps naively–that the pollsters and other political analytic professionals are not themselves biased and are in fact trying to give an accurate prediction as possible and not allowing their own opinions about the candidates influence their analysis. With that said, I hypothesize that this election was a perfect storm of polling biases, groupthink, and poor use of data that in the end resulted in the massive failures that occurred on election day.

Leave a Comment

Imagine you have some process in your organization’s workflow that consumes 50%-90% of your staff’s time and contributes no value to the end result. If you work in the data science or data analytics fields you don’t have to imagine that because I’ve just described what is, in my view, the biggest problem in advanced analytics today: the Extract/Transform/Load (ETL) process. This range doesn’t come from thin air. Studies from a few years ago from various sources concluded that data scientists were spending between 50%-90% of their time preparing their data for analysis. (Example from Forbes, DatascienceCentral, New York Times) Furthermore, 76% of data scientists consider data preparation the least enjoyable part of their job, according to Forbes.

If you go to any trade show and walk the expo halls, you’ll see the latest whiz-bang tools to “leverage big data analytics in the cloud”, and you’ll be awed by some amazing visualization tools. Many of these new products can do amazing things… once they have data, which brings us back to our original problem…that in order to use the latest whiz-bang tool you still have to invest considerable amounts of time in data prep. The tools seem to skip that part, and focus on the final 10% of the process.

1 Comment

Month: November 2016

A Data Scientist’s Perspective on the Election and What Went Wrong

The Biggest Problem in Data Science and How to Fix It