A Data Scientist’s Perspective on the Election and What Went Wrong

I originally drafted a version of this article in August, but decided not to post it because I didn’t want my blog to be political commentary. However, given the shocking election results and the epic failure of the political polling/predictive analytics industries I couldn’t resist sharing my thoughts on the matter. As a data scientist, I have been watching this election with a lot of anticipation and curiosity. Back in August, my original draft was entitled “What happens to Data Science if Trump wins?” and in it, I wrote some thoughts about what the impact would be to the data science world if Trump won. The main premise was that a Trump victory would be disruptive in how political campaigns are run and most importantly, how the analytics used to measure political campaigns would be called into question. I also thought that the value of the super-creepy targeted advertising that Facebook and other social media sites are using might get called into question. But more on that later… Lastly, I’m attempting to write this article without infusing my own political opinions into the central arguments. If I am successful, the reader will have no idea what my political views are.

Ultimately, there are two questions which need answers:

How is it that nearly every reputable news source and polling agency incorrectly predicted the election results?
How can data science be used to avoid repeated errors of this scale?

The first point really bears some fleshing out. It wasn’t just that everyone predicted a Clinton victory, it was that nearly every source–including the vaulted Nate Silver–predicted a massive Clinton victory.

For this discussion, I will presuppose–perhaps naively–that the pollsters and other political analytic professionals are not themselves biased and are in fact trying to give an accurate prediction as possible and not allowing their own opinions about the candidates influence their analysis. With that said, I hypothesize that this election was a perfect storm of polling biases, groupthink, and poor use of data that in the end resulted in the massive failures that occurred on election day.

Biased Data Leads to Inaccurate Results

If you consider the way that political polls are reported, pollsters will contact individuals via phone or email, using either robo-calls or actual in-person operators, and ask a series of questions intended to gather data about how a political candidate is perceived in a given area or within a particular demographic group. While there is a lot of statistical rigor employed in this process to insure a sample is representative of the population, in many ways, choosing a representative sample can be very subjective. Additionally, objective pollsters who are not trying to influence their results take great care in the wording of the questions, as well as the order in which they are asked, all to avoid skewing the results.

Social desirability bias skewed the results…

taboo-300x206 In our culture today, certain political views have become much more fashionable than others. A glaring example might be the current debate over a trans-gender individual’s right to use the washroom of their choice. Until recently, this topic was not in the national consciousness, however, in the last year it has been thrust into the limelight, and many individuals and companies have come out in support of this issue. On the other hand, a large segment of the American population, disagrees with this position, but they may be uncomfortable expressing that opinion lest they be labeled a bigot, homophobe etc. This example may seem imperfect, but there are countless examples of CEOs of major companies telling their employees that political opinions contrary to the prevailing view are not welcome. Most recently, the CEO of GrubHub told his employees that anyone who does not agree with his anti-Trump views should resign immediately. (Full GrubHub press release here) In another instance, Marc Zuckerberg reprimanded employees for stating that “All Lives Matter” on a Black Lives Matter posting inside the Facebook headquarters. Why does this matter? Let’s say that you are a Christian conservative who happens to work at one of these companies who does not share these opinions. Would you feel comfortable expressing an opinion that the CEO of your firm has stated is unacceptable? Or would you simply keep your mouth shut when asked?

The reluctance to express opinions perceived as unpopular translates as well into political polling and is known as social desirability bias. The essential concept is that when people are asked about a subject that is not socially desirable, their answers will be skewed in one way or another. For instance, when asked about income, often people with low incomes will overstate their incomes and people with high incomes will understate them. The wikipedia article cited here explains how this bias manifests itself but interestingly enough, does not mention unpopular political views as a category. Ultimately, the end result is that when asked about unfashionable political views, people will downplay their opinions to pollsters. Good pollsters can and do adjust their findings but ultimately, this is somewhat subjective.

This phenomenon is not limited to the US. This year, a similar situation played out in Great Britain in the so-called Brexit referendum. For those who did not follow the issue, Brexit was a referendum in which the British public was asked to choose whether to leave or remain in the European Union. In a drama which was eerily similar to the recent US elections, the “leave” campaign was portrayed as xenophobic, Islamophobic, racist, bigoted ignoramuses by the media. The day of the election, polling had the “stay” campaign up by over 10 percentage points. However, it was not to be. The “leave” campaign won 51.9% to 48.1% with massive voter turnout. Nate Silver has posted a series of mea culpas about why his organization missed major elections including the 2015 UK elections, and the 2015 Israeli elections. What is important to note is that in all these instances, it is the conservative position which has been underestimated in pre-election polling, leading me to suspect that social desirability bias may be playing a role in the polling results. While some pollsters dismiss this possibility, others do acknowledge it.

… and GroupThink compounds the error…

Before working as a data scientist, I worked as an intelligence analyst at the CIA. During our analytic training, we studied many different intelligence failures and one of the main causes is a phenomenon known as GroupThink. You can read the WikiPedia page about GroupThink here, but in an analytic context, it means that members of a group will arrive at irrational or erroneous conclusions in order to avoid disturbing the group. A famous example of an intelligence failure that resulted from GroupThink was the Yom Kippur war in 1973. For those not familiar, the Egyptians and Syrians launched a major offensive against Israel on the Holiest day of the Jewish year–Yom Kippur. After the war, the Israeli government investigation concluded that Israeli Intelligence had ample information to conclude that an attack was imminent, however, the political leaders chose to disregard clear evidence due to commonly held assumptions which were never challenged. Though the term really didn’t exist at the time, it was GroupThink.

Translating this to today’s situation, it would have taken a very courageous pollster to come out with a poll predicting a Trump victory when every single other poll predicted the opposite, even if the data supported that conclusion. From an outsider’s perspective, it would seem to me as if the public polls took a very conservative approach in their analysis of the data, and the result is that everyone misread the data.

…along with incomplete tradecraft.

Speaking here as a former intelligence analyst, one of the fundamental tenets of intelligence analysis is to make use of all available data. I believe that one of the fatal flaws of the pollsters is that they seem to base their predictions upon polls alone. It’s interesting to compare this to the financial analysis. Financial analysts are always looking for new data sources to help them make more accurate assessments. From the outside, it does not appear as if political analysts do the same and perhaps an increased use of more diverse data sets might lead to more accurate predictions.

As non-professional observer of the political process, one of the things I noticed this cycle was how the different candidates made use of social media–specifically Twitter. Mr. Trump was very active on Twitter and had an extremely large following. He tweeted often about all kinds of things and the tweets actually sounded like he was the one composing them. In contrast, Mrs. Clinton’s posts feature her own quotes… as in she quotes herself…but in general they seem like they are written by a member of her media relations team. They are often written in the 3rd person

screen-shot-2016-11-13-at-21-53-33 Trump also has 4 million more followers than Mrs. Clinton. Mr. Trump also out tweets Mrs. Clinton by a factor of four. Looking at the tweets there are some other metrics which jumped out at me. On election day, Mr. Trump’s tweets were retweeted in the 10k range, and received on average around 40k likes. In contrast, Mrs. Clinton’s tweets on election day received about 1/4 the engagement with about 2-3k retweets and about 10k likes.

This kind of engagement is mirrored on Facebook. Mr. Trump’s page has 14.5 million likes and 880,000 followers, whereas in contrast Mrs. Clinton has 9.4 million likes, but only 26,000 followers. For engagement, the numbers are all over the place, but a cursory glance reveals that Mr. Trump’s audience is FAR more engaged with his campaign on Facebook than Mrs. Clinton’s audience. I’m certain that with the complete data, one could derive a metric to reflect the level of engagement on social media.

Another example of social media playing a significant role was in the 2014 Gubernatorial elections in Maryland. Democrat Anthony Brown–hand picked successor of previous Governor Martin O’Malley–was running against Republican candidate Mr. Larry Hogan and was ahead in the polls by a considerable margin. Yet, Mr. Hogan ran an effective campaign, and much like Mr. Trump, makes extremely effective use of social media. In a stunning upset in perhaps the 2nd most Democratic state in the Union, Mr. Hogan ended up defeating Mr. Brown 51.7%-47%. You can read FiveThirtyEight’s mea culpa here on that election. Mr. Hogan remains extremely popular, with approval ratings upwards of 70%.

When uncertain, quote Heisenberg

In physics there is a principal called the observer effect which refers to the fact that by simply observing a system, you affect its results. In the case of using political metrics for predictive analytics, it might be more accurate to use metrics in which the subjects are not aware they are being measured. Social media metrics might be one such measurement. Social media allows us to gather real data about engagement without the subjects knowing their activity is being measured, or at least not caring. Social media really has only played a role in the last few elections, but I would strongly suspect that if we switched to a model favoring features such as social media engagement might be more predictive and less biased than phone polls. Other measurements might include things like event attendance as a predictor of political victory. Accurate numbers are certainly more difficult to obtain, but it seemed clear that Mr. Trump was able to draw much larger crowds at events than Mrs. Clinton was–particularly in battleground states.

All of these metrics mentioned here point to greater engagement favoring Mr. Trump. I would hypothesize that quantifying engagement might be more predictive and more accurate in the long term than the current techniques of opinion polls.

Over-reliance on National Polls

My final question really leaves me baffled. Prior to the elections, news outlets constantly would produce the latest national poll indicating that Clinton was ahead. Interestingly enough, these polls were technically correct… Mrs. Clinton did in fact win the popular vote and hence won the only national poll that matters. However, we do not elect our President on this basis, and hence I question the value of this information.

It reminds me of an old joke in which there is a man is in a hot-air ballon who is lost. He sees someone on the ground and asks where he is. The person on the ground replies “You’re in a balloon.” The man in the balloon instantly knew where he was at that point and was able to get home safely. When asked how he did it, he replied, well, when I was told that I was in a balloon, I knew that I was by Microsoft Help Desk Headquarters. They gave me an answer that while correct was totally useless. But from there, I was able to figure out where I was and how to get home. I suspect that the same is true for the national polls. They are what might be called “vanity metrics”. While in most cases, the popular vote winner is in fact the actual winner, it has happened several times throughout history where the popular vote winner did not win the electoral college and did not become president. The reality is that presidential elections are fought in key battleground states. Polling these states is probably more predictive and more useful than national polls.

Changing the Paradigm: What do we do now?

I read an article in the Wall Street Journal by L. Gordon Crovitz entitled Trump’s Big Data Gamble about data science’s use in the elections which caught my eye. The WSJ article highlights how the Clinton campaign and the Democratic Party in general has made very extensive use of data science to precisely target their message whereas the Trump campaign has not and preferred a strategy of blasting out their message to whomever will listen. Mr. Trump and his staff are clearly expert users of social media and Crovitz quotes an estimate that asserts that Mr. Trump has received over $2 billion in advertising from his social media presence.

The WSJ article ends with the statement: “Mr. Trump may not like it, but data from past presidential elections finds an undeniable correlation: The candidate with the best data is the winner.” I don’t know how you draw a conclusion from two, maybe three elections where big data could have played a role, but logically, from this statement, one would expect that Mrs. Clinton and her extensive use of data, are shoe-ins for the White House. However, as we see now, that prediction was also incorrect. Unless things changed since Crovitz wrote that article, the candidate with the best data lost.

Money matters, or does it?

donald-trump Money has also been used as a predictor of election results. However in past elections, money is used to purchase advertising and hence the candidate with the most money can hire the best publicists and can purchase the best advertising. I believe that we are seeing a shift in political campaigns in that campaigns are able to overcome a lack of financial resources by investing a lot of time in creating a strong social media presence for their candidate. Trump obviously did this as did Mr. Hogan in Maryland. I’m sure there are other examples. In future elections, if candidates are able to engage with prospective voters through social media which does not cost as much, the importance of fundraising and value thereof as a predictor of election results might be somewhat diminished. It is also important to note that Mr. Trump won largely without paid advertising on social media. Most of his activity was simply his own postings.

Since Mr. Trump has won without the assistance of creepy hyper-targeted social media advertising, what does it say for the efficacy of this advertising, and the value of it? As you may or may not be aware, Facebook has an entire business unit dedicated to political advertising and offers services to build supporter base, matching voter files, scaling the “get out the vote” effort and more. I would be curious as to Facebook’s track record for assisting its clients win elections.

Data Science Has Limits

screen-shot-2016-09-23-at-12-31-04 Bringing this back to the implications to data science, the first thing that will have to happen is that data scientists and the data analytics industry will have to examine all the hyperbolic claims which are being made about data science. The Trump victory should remind the industry that data science cannot take a poor product (or candidate) and through some wizardry convince the world to accept the product. In short, data is powerful when used correctly, but also has limits. Data Science can assist a company allocate resources most efficiently and can certainly give a company a competitive advantage, but it cannot convert a Yugo into a Rolls Royce. Both candidates in this election were deeply flawed as political candidates, but Mr. Trump managed to convince the voters that he was a better candidate than Mrs. Clinton.

Trump’s victory might also have somewhat of a dampening effect on ubiquitous data collection. The current mindset of data-collectors is to collect as much data about everything as possible–often at the expense of individual privacy. In today’s world, data collection platforms, such as Facebook, Google, etc. often lull their customers into a sense of complacency by offering free services in exchange for their customers’ most personal data–such as for whom they intend to vote. In the wake of Trump’s victory, I doubt that the volume of data collection would change drastically, but rather, the types of data might change. Data scientists and pollsters should have to reexamine the data to see what metrics were present that could be predictive of a campaign’s victory or defeat.

TL;DR

In conclusion, I hypothesize that the epic failure to predict Trump’s victory was caused by a perfect storm of biased polling data (not pollsters), the rise of importance of social media in political campaigns, and the failure of the polling industry to notice that fact. I hope that in future campaigns, political analysts will make greater use of metrics such as social media engagement, event attendance etc to form a more accurate model of public opinion.

Share the joy