It’s been an interesting year both career wise and generally. This last year, I’ve had the amazing opportunity to speak at numerous conferences around the world, as well as give classes all over the world in data science and Apache Drill. I’ve also learned a lot about the internals of Drill and even contributed to the codebase. With that said, one can never rest on one’s laurels and as such I have a lot in store for the year.
Category: General Thoughts
I originally drafted a version of this article in August, but decided not to post it because I didn’t want my blog to be political commentary. However, given the shocking election results and the epic failure of the political polling/predictive analytics industries I couldn’t resist sharing my thoughts on the matter. As a data scientist, I have been watching this election with a lot of anticipation and curiosity. Back in August, my original draft was entitled “What happens to Data Science if Trump wins?” and in it, I wrote some thoughts about what the impact would be to the data science world if Trump won. The main premise was that a Trump victory would be disruptive in how political campaigns are run and most importantly, how the analytics used to measure political campaigns would be called into question. I also thought that the value of the super-creepy targeted advertising that Facebook and other social media sites are using might get called into question. But more on that later… Lastly, I’m attempting to write this article without infusing my own political opinions into the central arguments. If I am successful, the reader will have no idea what my political views are.
Ultimately, there are two questions which need answers:
- How is it that nearly every reputable news source and polling agency incorrectly predicted the election results?
- How can data science be used to avoid repeated errors of this scale?
The first point really bears some fleshing out. It wasn’t just that everyone predicted a Clinton victory, it was that nearly every source–including the vaulted Nate Silver–predicted a massive Clinton victory.
For this discussion, I will presuppose–perhaps naively–that the pollsters and other political analytic professionals are not themselves biased and are in fact trying to give an accurate prediction as possible and not allowing their own opinions about the candidates influence their analysis. With that said, I hypothesize that this election was a perfect storm of polling biases, groupthink, and poor use of data that in the end resulted in the massive failures that occurred on election day.
Imagine you have some process in your organization’s workflow that consumes 50%-90% of your staff’s time and contributes no value to the end result. If you work in the data science or data analytics fields you don’t have to imagine that because I’ve just described what is, in my view, the biggest problem in advanced analytics today: the Extract/Transform/Load (ETL) process. This range doesn’t come from thin air. Studies from a few years ago from various sources concluded that data scientists were spending between 50%-90% of their time preparing their data for analysis. (Example from Forbes, DatascienceCentral, New York Times) Furthermore, 76% of data scientists consider data preparation the least enjoyable part of their job, according to Forbes.
If you go to any trade show and walk the expo halls, you’ll see the latest whiz-bang tools to “leverage big data analytics in the cloud”, and you’ll be awed by some amazing visualization tools. Many of these new products can do amazing things… once they have data, which brings us back to our original problem…that in order to use the latest whiz-bang tool you still have to invest considerable amounts of time in data prep. The tools seem to skip that part, and focus on the final 10% of the process.
I recently read Taming the Big Data Tidal Wave by Bill Franks of Teradata and in the book (which is going on my recommended reading list) he has a section about the ideal analyst. While I am admittedly very biased on this one, Mr. Franks makes a very good point that in many instances the best analysts have a musical or other creative ability in addition to math and computer science skills. In my experience, the best data scientists that I’ve worked with all have had some creative side to them–be it music, art or whatever. Thus, here is my case why play an instrument is perhaps some of the best preparation to think like an analyst.
Musicians are trained in ETL
This may seem out of place, but consider what happens when a musician receives a piece of music to play for the first time. Most musicians will read through the sheet music and either sing through it via solfege, or otherwise mentally convert the written notes on the page into a mental version of the music. Every musician has their own method, but basically, they’ll transform the notes on the page into a mental version of the music.