A big interest of mine is how to impart what little I know of the tools and techniques of data science to others. When I was at Booz Allen, I taught numerous classes both for internal staff and for various clients. I’ve also taught for Metis, O’Reilly Publishing and for the last three years, at BlackHat so I do have some experience in the matter. I’ve looked at MANY data science programs to see if what they are teaching lines up what I’m teaching and I’d like to share some things which I’ve noticed which will hopefully help you build a better data science program. My goal here is to share my mistakes and experiences over the years and hopefully if you are building a data science training program, you can learn from what I learned the hard way. I make no claims to be the perfect data science instructor, and I’ve made plenty of mistakes along the way.
While I’m at it, I’ll put in a plug for an upcoming data science class which I am teaching with Jay Jacobs of BitSight Security at the O’Reilly Security Conference in NYC, October 29-30th.
Really, data science instruction is an optimization problem: as an instructor, your goal is to minimize confusion whilst maximizing understanding. To do this, you must remove as many obstacles as possible from the students’ path which create dissonance. This may seem silly, but I have observed that if you have small errata in your code, or your code doesn’t work on their machine, even due to something they did, it significantly detracts from their learning experience and their opinion of you as an instructor. Therefore, removing all these obstacles to understanding is vital to your success as an instructor.
Define Clear Goals
Most of the courses I’ve taught are short classes targeted at working professionals, but with that said, regardless of how long the class is, you can’t teach the students every data science technique under the sun. This statement should be obvious, yet I have seen many classes which attempt to cram the entire class full of algorithms without putting them in context. When I design a course, I try to make sure that every technique and concept has a purpose which relates to the purpose of the class. Most of my recent courses are targeted towards security professionals. I have observed that many security problems can be cast as either clustering or classification problems, so those are the techniques I place the most emphasis on in my classes. It doesn’t mean that I wouldn’t teach regression, at all, but if time is limited, I am going to prioritize classification techniques.
To me, it is really important that students understand how to apply the techniques they are learning. So when I teach, I will always include case studies demonstrating how others have used the technique being taught, as well as conceptual problems to get students thinking about framing a data science problem. One example of this might be dividing the class into groups and having them explain (no code) how they would design a system to detect fraudulent credit card transactions. I like encourage the students to think about what kinds of data they might need to accomplish this task, other issues (privacy. latency, compliance, etc.) I try to give students diverse problems from different industries just to get them thinking about the various issues they will likely encounter.
Good Analysis Comes from Good Process
Throughout my career doing analytic work, I have come to appreciate the idea that good analysis comes from good process. I would argue that since data science is inherently an analytic activity, it is vital that students understand the thought and technical processes behind data science–from end to end. Therefore, when teaching data science, in my opinion, it is vital to teach the students the analytic process from end to end. This means, teach the idea of forming an analytic question which your analysis is seeking to answer, followed by a hypothesis which answers that question. (Sound familiar?) This may seem obvious, but I’ve found that students tend to forget that all the code and math you are teaching has a purpose.
I’ve written about this in past postings, but I believe that much can be learnt from failures. When I was at the CIA, we spent considerable amounts of time studying past intelligence failures to see where the analysis went wrong. This same approach can and should be applied to data science, and it is something which I try to teach to my students. Generally I will try to bring in data science or other analytic failures and have the students discuss what went wrong and how that can be prevented. Many failures can be traced back to bias or faulty analytic processes that should have caught the biases.
Don’t Forget About the Data!
A big criticism that I have of many data science programs is that they do not place enough emphasis on the process of getting and cleaning data–instead choosing to use the Iris data set, or the Boston Housing data set, or any of the other commonly used data sets which are available online. The issue I have with using these datasets comes down to context. Most of my students aren’t trying to classify flowers, or predict housing prices in Boston. They are trying to find anomalies in their network, or find malware and I’ve found that it can be difficult for students to apply what they’ve learned to their real world issues. I actually had a student once refer to one of these datasets as a “pile of numbers”. What she was really trying to say was that she didn’t have any real connection to the dataset that we were working with and had no context.
Getting back to the original point, I usually like to spend more time than most really getting into the weeds with Pandas, Drill and other data gathering and cleaning tools. In my opinion, the stronger a data scientist is at getting and cleaning data, the less time they will have to spend doing this and therefore the more time they will spend doing the really hard task of actually analyzing the data.
Use Real World Data for Real World Problems
One of my personal favorite problem to use to teach machine learning–particularly classification–is DGA detection. If you aren’t familiar with DGAs it stands for Domain Generating Algorithm, and they are used by botnets to generate multiple domains to communicate with a control server. These domains look like this: ceroyw4rytrdfvtewcsd.com or qwedcweiuterifsd.com etc. It is easy for a human to look at these domains and see that they are gibberish, but it is not so easy to get a computer to do that. Since botnets generate new domains daily, it is not possible to defeat them with whitelists or blacklists and hence machine learning can be very helpful in identifying and blocking these kinds of attacks. There has been considerable academic research on this topic as well. I like this problem because: it is a real world problem and it is easy to generate tons of data to use. Additionally, there are a lot of ways to approach the problem from simple classifiers to deep learning. Most importantly, the students have to do the feature engineering and in doing so, they gain an understanding of the process and of the data. In so doing the students also learn how to work with real data that isn’t perfect and requires techniques to get it to the point where it can be analyzed.
The URL data is quite versatile and can also be used for clustering problems as well as multi-class classification problems, plus it is easy to understand, even if the students don’t have a security background.
Clear Instructional Code Helps Students Understand the Process
This point seems like a no-brainer, but yet it isn’t. I’ve seen WAY too many tutorials and bootcamp code that completely ignores this rule. My rationale is that your students are learning difficult material, and some may not have a lot of coding experience. When you as an instructor provide code to the students, you should assume that it will be used (and perhaps reused) as a reference for them in the future. Therefore, when writing code for instructional purposes, it is vital that the students be able to clearly see the flow and process behind what you are trying to do.
Furthermore the students have a finite amount of mental energy and time. The more mental energy the students have to expend deciphering your code, the less will be available for understanding the data science concepts. This is not meant to be disparaging of the students in any way, but just a recognition of the fact that they have a finite amount of time and energy to dedicate and you want to maximize their understanding. Therefore when writing code for classes I deliberately avoid or minimize the use of:
- Array slicing
- List comprehensions
- Lambda functions
- Complex regular expressions
These are all solid python concepts, and important to understand, but for relative newcomers, when I’m teaching data science, I want the students focused on data science and not python minutia. As an example, let’s say that you have a pandas dataframe and that you want to extract a few columns from it and one-hot-encode those columns. You could write code like this:
pd.concat( [df, pd.get_dummies( df[9:13][df.price>150]], axis=1 )
Now, there is nothing wrong with this code. (Ok I didn’t actually test this, so there may be something wrong with this code, but let’s assume there isn’t for the time being.) For students, there is a lot going on in this one line and it is difficult to see the process. So, when I write code for instructional purposes, I try to break things out so that students can see and understand the logic and flow. Compare that with the following:
categorical_column_names= ['cat1', 'cat2', 'cat3'] #Filter data to only include rows where the price is greater than 150. filtered_data = data[ data['price'] > 150] #Remove all other columns besides the categorical columns categorical_columns = filtered_data[ categorical_column_names ] #Get dummy columns dummy_columns = pd.get_dummies( categorical_columns ) #Merge the dummy columns with the other columns final_data = pd.concat( [filtered_data, dummy_columns], axis =1 )
Now, in real life, you’d never write code that way. It takes up too much space, is too verbose, and I’d guess that the first version is probably more efficient and runs faster. But for a student who has never done this before, the second version is MUCH more understandable and most importantly they will be able to repeat that process using their own data.
Use the Latest Modules
The Python data science stack (Python, NumPy, Pandas, MatPlotLib, Scikit-Learn, SciPy) are well designed modules. I’m not a regular R user, but I would assume the same is true for the R data science stack. As an instructor you are doing your students a disservice if you are not regularly updating your lessons to include the latest features. For starters, this means USE PYTHON 3!! Even Guido van Rossum–the creator of Python–says that it’s time to switch! Support for Python 2.7 is ending in 2020 and I can’t think of an important module that isn’t available for Python 3, so there really isn’t a reason to be using Py2.7 anymore. Yes, I know that these are fighting words. Bring it on!
While we’re on the subject, modules introduce new functionality ALL THE TIME. Also, new modules are released ALL THE TIME that make doing data science easier. When I teach, I view it as a part of my job to be abreast of all the latest developments so that I can pass this knowledge on to my students. So, if your lessons throw deprecation warnings or don’t work at all, then you probably should spend some time updating your lessons. If your lessons aren’t taking advantage of the latest features of scikit-learn, or pandas, then update your bloody lessons. If your lessons aren’t incorporating the latest modules such as Seaborn or YellowBrick, then update your bloody lessons.
I know the deprecation warnings don’t seem like a big deal, but to a student who has never done this before, and possibly is new at coding, seeing the big red bar across the screen means they did something wrong, or that you are ill-prepared for your lesson. Ultimately, it distracts them from whatever concepts you are trying to teach and undermines your credibility as an instructor.
Consistent Environment Helps Students
When I first started teaching, I used to send a “pre-work” sheet to the students which included detailed instructions about how to set up their computer for the class. I also deliberately tried to keep the class requirements to a minimum so that students can get everything working. What I’ve found over the years is that if you take this approach, you will spend about 20% of your class time dealing with troubleshooting students’ environments. If you have complex requirements for your class, you’re looking at 30-50% of your time. I’ve had students take my classes on everything from gaming laptops to Microsoft Surfaces, using every operating system from Windows 7 to Linux and trust me, you don’t want to take on the role of tech support.
My colleagues and I at GTKCyber came up with a solution to this problem which has served us very well: The Griffon Data Science Virtual Environment. Griffon is a virtual machine specially built for data science. We incorporated it into our instruction two years ago and the student experience has been night and day. Most recently, we taught a two-day data science class at BlackHat this last year and had over 70 students in the class. Of those students, we had precisely zero technical issues that we had to troubleshoot. ZERO. As a result we were able to spend the entire class time teaching data science techniques instead of trying to figure out why NumPy won’t install properly on someone’s machine. Again, maximize understanding, minimize confusion.
Data Science is More Than Machine Learning
That’s right… it is. In both Drew Conway’s Venn diagram, and Stephen Kolassa’s updated Venn diagram, both list other things besides machine learning as being components of data science, so I try to include other topics in my classes such as:
- Data visualization
- Data engineering
- Version control with github
- Virtualization with Docker
- Good programming habits
- How to debug code
- Big data tools
- Scaling your project
These topics aren’t strictly data science, but are extremely useful to relative newcomers to the field. I’ve found that many students taking data science classes have backgrounds such as statistics, physics, economics, etc. but may never have really had to dig into computer science.
In conclusion, these are some things which I’ve learned over the years of teaching various data science classes. I hope this inspires some positive discussion about the art of teaching data science.