Skip to content

The Dataist Posts

A Data Scientist’s Perspective on the Election and What Went Wrong

I originally drafted a version of this article in August, but decided not to post it because I didn’t want my blog to be political commentary.  However, given the shocking election results and the epic failure of the political polling/predictive analytics industries I couldn’t resist sharing my thoughts on the matter.  As a data scientist, I have been watching this election with a lot of anticipation and curiosity.  Back in August, my original draft was entitled “What happens to Data Science if Trump wins?”  and in it, I wrote some thoughts about what the impact would be to the data science world if Trump won.  The main premise was that a Trump victory would be disruptive in how political campaigns are run and most importantly, how the analytics used to measure political campaigns would be called into question.  I also thought that the value of the super-creepy targeted advertising that Facebook and other social media sites are using might get called into question.  But more on that later…  Lastly, I’m attempting to write this article without infusing my own political opinions into the central arguments.  If I am successful, the reader will have no idea what my political views are.

Ultimately, there are two questions which need answers:

  1. How is it that nearly every reputable news source and polling agency incorrectly predicted the election results?
  2. How can data science be used to avoid repeated errors of this scale?

The first point really bears some fleshing out.  It wasn’t just that everyone predicted a Clinton victory, it was that nearly every source–including the vaulted Nate Silver–predicted a massive Clinton victory.

For this discussion, I will presuppose–perhaps naively–that the pollsters and other political analytic professionals are not themselves biased and are in fact trying to give an accurate prediction as possible and not allowing their own opinions about the candidates influence their analysis.  With that said, I hypothesize that this election was a perfect storm of polling biases, groupthink, and poor use of data that in the end resulted in the massive failures that occurred on election day.

Continue reading A Data Scientist’s Perspective on the Election and What Went Wrong

Share the joy
Leave a Comment

The Biggest Problem in Data Science and How to Fix It

Imagine you have some process in your organization’s workflow that consumes 50%-90% of your staff’s time and contributes no value to the end result.  If you work in the data science or data analytics fields you don’t have to imagine that because I’ve just described what is, in my view, the biggest problem in advanced analytics today: the Extract/Transform/Load (ETL) process.  This range doesn’t come from thin air.  Studies from a few years ago from various sources concluded that data scientists were spending between 50%-90% of their time preparing their data for analysis.  (Example from Forbes, DatascienceCentral, New York Times) Furthermore, 76% of data scientists consider data preparation the least enjoyable part of their job, according to Forbes.

If you go to any trade show and walk the expo halls, you’ll see the latest whiz-bang tools to “leverage big data analytics in the cloud”, and you’ll be awed by some amazing visualization tools.  Many of these new products can do amazing things… once they have data, which brings us back to our original problem…that in order to use the latest whiz-bang tool you still have to invest considerable amounts of time in data prep.  The tools seem to skip that part, and focus on the final 10% of the process. Continue reading The Biggest Problem in Data Science and How to Fix It

Share the joy
Leave a Comment

Tips for Debugging Code without F-Bombs – Part 2

This post is a continuation of my previous tutorial about debugging code in which I discuss how preventing bugs is really the best way of debugging.  In this tutorial, we’re going to cover more debugging techniques and how to avoid bugs.

Types of Errors:

Ok, you’re testing frequently, and using good coding practices, but you’ve STILL got bugs.  What next??  Let’s talk about what kind of error you are encountering because that will determine the response.  Errors can be reduced to three basic categories: syntax errors, runtime errors, and the most insidious intent errors.  Let’s look at Syntax errors first. Continue reading Tips for Debugging Code without F-Bombs – Part 2

Share the joy
Leave a Comment

Why Musicians Make Good Analysts

I recently read Taming the Big Data Tidal Wave by Bill Franks of Teradata and in the book (which is going on my recommended reading list) he has a section about the ideal analyst.  While I am admittedly very biased on this one, Mr. Franks makes a very good point that in many instances the best analysts have a musical or other creative ability in addition to math and computer science skills.  In my experience, the best data scientists that I’ve worked with all have had some creative side to them–be it music, art or whatever.  Thus, here is my case why play an instrument is perhaps some of the best preparation to think like an analyst.

Musicians are trained in ETLravel-bolero

This may seem out of place, but consider what happens when a musician receives a piece of music to play for the first time.  Most musicians will read through the sheet music and either sing through it via solfege, or otherwise mentally convert the written notes on the page into a mental version of the music. Every musician has their own method, but basically, they’ll transform the notes on the page into a mental version of the music.   Continue reading Why Musicians Make Good Analysts

Share the joy
Leave a Comment

Drill UDF List

drillLogoI’ve been working on developing some custom functions for Drill, or User Defined Functions and I realized that there really should be a repository for Drill UDFs.  I’ve decided to create a page with links to all the UDFs that I know of.  I’ll keep this updated, so please if you have Drill UDFs that you want to share, please email me a link and I’ll put it up.

Share the joy
Leave a Comment

Fixing STEM Education

To both of my loyal readers, I apologize for not writing anything in a while, but I have been absolutely slammed with classes and conference presentations.  Anyway, I’ve been doing a lot of thinking about my earlier post about Teaching Data Science in English.   The post provoked a decent response, mostly positive.

One reader sent me the following comment about my post which I’ve decided to quote (with permission) in its entirety because I think it accurately reflects why people get so frustrated when they try to learn mathematical concepts. What interested me was that this individual took action and “translated from mathspeak to English” and all of a sudden she was able to understand the underlying concepts.

Awhile ago I read a piece you had written on LinkedIn about making ‘mathspeak’ and ‘techspeak’ (i.e. coding) more accessible to regular people, by decreasing mathematical notation usage and increasing the use of real words in explanations of formulas and concepts. It was something that stayed with me because I’ve always understood broader mathematical concepts but have always had trouble with the mechanics, and I think a lot of that has had to do with the amount of notation used…math seems like a foreign language sometimes, and there are 2 levels of understanding: the first is merely deciphering the ‘foreign language’, which already puts me out of my comfort zone (think reading Spanish or French if you are a native English speaker) and then understand the underlying concepts, which becomes harder due to the fact that it’s written in a ‘non-native’ language. Recently I’ve started taking an online course in machine learning on XXX. Already in the second lesson, he dove straight into notation-filled formulas, and I was starting to get that overwhelmed feeling that I’m familiar with from previous years of math. But I had what you wrote in my mind, and I thought I’d give it a shot and manually ‘translate’ the formulas and equations into English, and stick with that. Well, I did that, and it worked so well. I feel that I am able to follow along with the underlying theory of the class and by extension, the formulas and algorithms he presented in ‘mathese’ whereas before I would have shut-down and assumed it was beyond my grasp. Thanks so much for highlighting this aspect of the math/English understanding divide. It is continuously helpful for me. (Emphasis mine)

I’d like to share another related story.  One of my first paying jobs was working for KUAT public television as the web developer (www.kuat.org) and I wanted to do some things that required automating a data flow from an archaic DOS based database.  I was teamed up with a programmer who helped me build the process and in doing so, I learned how to write regular expressions.  I got so into it, I nearly automated myself out of a job.

Fast forward a year or so, when I was nearly done with my CS degree, I had to take an upper level CS course about Automata, Grammars and Languages, which included regular expressions in the course description.  I was pretty excited because by this point, I had become a master at regular expressions and was looking forward to a class that I knew some of the material going in.  Boy was I in for a shock.  When we got to the regular expressions section, it degenerated into a plethora of Greek letters and assorted jargon to the point where I truly loathed going to class.

Theory Should Not Be Taught at the Expense of Application

What I also realized in that CS class was that most of my fellow students may have passed the tests, they did not have any clue how to use regular expressions in real life, or why you would want to use them in the first place.  While we were spending time writing expressions that match ‘aaaaaaabababaaaaa‘ and drawing the automata that “implement” that, the knowledge of how to apply this to a real life problem, such as extracting data artifacts from raw data, was completely lost on the class.

What if the instructor had started the class by showing us this:

pattern = '([a-zA-Z0-9_.]+)@([a-zA-Z0-9_.]+\.\w{2,3})'
matchObj = re.match( text, pattern )
if matchObj:
email = matchObj.groups(0)
account = matchObj.groups(1)
domain = matchObj.groups(2)

If you’re not familiar, this brief example in python-esque pseudo code demonstrates how to match, and extract email addresses, accounts and domains from text.

I don’t think I’m saying anything new here, but too many technical classes both in academia and out, spend a disproportionate amount of time on the underlying theory, whilst simultaneously ignoring, or downplaying the actual application of the concepts being taught.  The result is that many students walk away frustrated, not understanding the actual use of what they are learning, and while professors and instructors may pat themselves on the back for preserving the “purity” of their curricula, I would argue that they have utterly failed in their task of educating their students.

The bottom line here, is that some people are really interested in theory, however for knowledge to be translated into something useful, students should be exposed early and often to a theory’s application and in conclusion, if you are designing some STEM training or a classes at a university don’t forget the importance of demonstrating how to apply the concepts you are teaching.

Share the joy
Leave a Comment

Conference Reflections Part 1: Open Data Science Conference East

My employer is amazing and in the last two months, they’ve allowed me to attend a lot of data science conferences and I thought I’d share some general reflections on my experiences.

Open Data Science Conference: A Great Value for New Comers

UnknownI gave two presentations this year at Open Data Science Conference (ODSC) East and I just wanted to put it out there that if you are new to data science or are just interested in learning more about data science, then ODSC is a really great venue to meet incredibly talented individuals as well as attend high quality technical talks. Continue reading Conference Reflections Part 1: Open Data Science Conference East

Share the joy
Leave a Comment

Something is Rotten in the State of Data

I’m writing this blog post in the departure lounge at Heathrow, on my way back from Strata + Hadoop World, London.  Whilst at Strata, speakers kept coming back to the idea that an ever growing number of large businesses are not really happy with the investments they have made in analytics and data science.  One of the speakers quoted a Forrester 2016 survey which claimed that:

  • 29% of firms are good at translating analytics results into measurable business outcomes
  • -20% change in satisfaction with analytics initiatives between 2014 and 2015
  • 50% of firms expecting to see stagnation or a decrease in big data/data lake investments in 2016.

These are very disappointing numbers, however not completely unsurprising.  Using the “5 why” technique an intra-management dialogue might go something like this: Continue reading Something is Rotten in the State of Data

Share the joy
2 Comments

A Social Contract for Data Collection

I just returned from Strata + Hadoop World in San Jose, where I gave a talk entitled “Kosher Collection: Best Practices in Data Handling“.  I really had an amazing time at Strata this year and major kudos to the organizers for putting on a great show.

The central premise of my talk is that in today’s world, there is a social contract between data collectors and consumers.  Essentially the agreement is that consumers give their personal data to a data collector in exchange for mutual benefit.  The problem is that consumers, in general, lack an understanding of the technology as well as data collection and as a result, are unable to provide informed consent.  Furthermore, this issue is likely to be exacerbated in the future as the opportunity to opt-out of mass data collection is disappearing.

Continue reading A Social Contract for Data Collection

Share the joy
1 Comment