I originally drafted a version of this article in August, but decided not to post it because I didn’t want my blog to be political commentary. However, given the shocking election results and the epic failure of the political polling/predictive analytics industries I couldn’t resist sharing my thoughts on the matter. As a data scientist, I have been watching this election with a lot of anticipation and curiosity. Back in August, my original draft was entitled “What happens to Data Science if Trump wins?” and in it, I wrote some thoughts about what the impact would be to the data science world if Trump won. The main premise was that a Trump victory would be disruptive in how political campaigns are run and most importantly, how the analytics used to measure political campaigns would be called into question. I also thought that the value of the super-creepy targeted advertising that Facebook and other social media sites are using might get called into question. But more on that later… Lastly, I’m attempting to write this article without infusing my own political opinions into the central arguments. If I am successful, the reader will have no idea what my political views are.
Ultimately, there are two questions which need answers:
- How is it that nearly every reputable news source and polling agency incorrectly predicted the election results?
- How can data science be used to avoid repeated errors of this scale?
The first point really bears some fleshing out. It wasn’t just that everyone predicted a Clinton victory, it was that nearly every source–including the vaulted Nate Silver–predicted a massive Clinton victory.
For this discussion, I will presuppose–perhaps naively–that the pollsters and other political analytic professionals are not themselves biased and are in fact trying to give an accurate prediction as possible and not allowing their own opinions about the candidates influence their analysis. With that said, I hypothesize that this election was a perfect storm of polling biases, groupthink, and poor use of data that in the end resulted in the massive failures that occurred on election day.