Let me start out by saying this is purely hypothetical as I’ve never been to a data science bootcamp, but I have taught them and have reviewed MANY curricula. I’ve also mentored a decent number of bootcamp graduates. In general, what I see is that bootcamps tend to place a lot of emphasis on machine learning but there’s a lot more to being a successful data scientist. The list below are five areas which I believe would benefit any aspiring data scientist.
Let’s start with an easy one. SQL. Despite all the trashing that it gets, SQL is an enormously useful language to know. Despite all the hype one hears about NoSQL and other non-relational datastores, SQL is still in widespread use and is not likely to go anywhere anytime soon. Let me tell you why you should learn SQL….
I ran into an issue whilst doing a machine learning project involving some categorical data and thought I would write a brief tutorial about what I learned. I was working on a model which had a considerable amount of categorical data and I ran into several issues which can briefly be summarized as:
Categories that were present in the training set were not always present in the testing data
Categories that were present in the testing set were not always present in the training data
Categories from “real world” (IE non testing or training) data were not present in the training or testing data
Handling Categorial Data: A brief tutorial
In Python, one of the unfortunate things about the scikit-learn/pandas modules is that they don’t really deal with categorical data very well. In the last few years, the Pandas community has introduced a “categorical” datatype. Unfortunately, this datatype does not carry over to scikit-learn, so if you have categorical data, you still have to encode it. Now there are tons of tutorials on the interweb about how to do this, so in the interests of time, I’ll show you the main methods:
GetDummies in Pandas
The most conventional approach and perhaps the easiest is pandas get_dummies() function which takes the input of a given column or columns and returns dummy columns for each category value. (Full docs here). Thus you can do the following:
df = pd.get_dummies(df)
Which turns the table on the left into the table on the right.
As you can see, each category is encoded into a separate column with the column name followed by an underscore and the category variable. If the data is a member of that category, the column has a value of 1 otherwise the value is zero, hence the name One Hot Encoding.
In general this works, but the pandas method has the problem of not working as a part of a Scikit-Learn pipeline. As such scikit-learn also has a OneHotEncoder which you can use to do basically the same thing.
Personally, I find scikit’s OneHotEncoder to be a bit more difficult to use, so I didn’t really use it much, however, in my recent project I realized that I actually had to for a reason I’ll get to in a bit.
Scikit Learn’s OneHotEncoder
Scikit-Learn has the OneHotEncoder() (Docs here) which does more or less the same thing as the pandas version. It does have several limitations, and quirks. The first being that the data types of your categories must be the same. IE if you have ints and strings, no go.. Secondly, scikit’s encoder returns either a numpy array or a sparse matrix as a result. Personally, this was annoying for me as I wanted to see what categories were useful as features, and in order to do so, you have to reconstruct a dataframe, which is a headache. In general, the code follows scikit’s general pattern of fit(), transform(). Here is example code of how to use scikit’s one hot encoder:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
encoded_data = encoder.fit_transform(df[<category_columns>])
There are two advantages that I see that scikit’s method has over pandas. The first is that when you fit the scikit encoder, it now “remembers” what categories it has seen, and you can set it to ignore unknown categories. Whereas pandas does not have any recall and will just automatically convert all columns to dummy variables. The second is that you could include the OneHotEncoder in a pipeline which seemed like that would be advantageous as well. However, these advantages did not outweigh the difficulty of getting the data back into a dataframe with column labels. Also, I kept getting errors relating to datatypes and got really frustrated.
The original problem I was having was that you couldn’t guarantee that all categories would be present in both the training and testing set, so the solution I came up with was to write a function that switched the category value to “OTHER” if the category was not one of the top few. But I didn’t like this approach because it required me to maintain a list of categories and what happened if that list changed over time? Surely there’s a better way…
Feature-Engine: A Better Solution
So what if I told you there was a way to encode categorical data such that you could:
Handle missing categories in either testing, training or real world data
Export the data to a DataFrame for easy analysis of the newly created features
Automatically aggregate categories with few values into an “other” category
Well you can’t so get over it. Ok, just kidding. I wouldn’t write a whole blog post to have it end like that… or would I? As it turns out, I stumbled upon a really useful module called feature-engine which contains some extremely useful tools for feature engineering that frankly should be included in Scikit-Learn. This module contains a collection of really useful stuff, but I’m just going to focus on the OneHotCategoryEncoder. (Docs here)
Let’s say you wanted to encode the data above, using the OneHotCategoryEncoder() you could create an encoder object as shown below:
from feature_engine import categorical_encoders as ce
import pandas as pd
# set up the encoder
encoder = ce.OneHotCategoricalEncoder(
# fit the encoder
Now, once we have the encoder object, we can encode our data using the fit()/transform() or the fit_transform() methods as shown above. Our toy data set above only has 3 categories, but what if it had 300? Feature-Engine provides an option in the constructor, top_categories, which has the effect of collapsing the into a more manageable number. For example, you could set the top_categories to 10 and that would get you the 10 most frequently occurring category columns and all others would be collapsed into an “other” column. That’s a nice feature! Well done!
There’s more. In our previous example, we had three categories when we fit the data, ‘A’, ‘B’ and ‘C’. So what happens if we have another category in the data that did not appear in the training data? Good question, and one that is not explicitly addressed in the documentation. So I tried this out and if you have the top_categories set, the encoder will ignore the unknown categories. This is debatable as to whether this is good design or not, but what it does mean is that it will work much better in real world applications.
Since the OneHotCategoricalEncoder uses the fit()/fit_transform()/transform() from scikit-learn, it can be used in a Pipeline object. Finally, and perhaps most important to me, is that the OneHotCategoricalEncoder returns a pandas DataFrame rather than numpy arrays or other sparse matrices. The reason this mattered to me was that I wanted to see which categorical columns actually are adding value to the model and which are not. Doing this from a numpy array without column references is exceedingly difficult.
In conclusion, both scikit-learn and Pandas traditional ways of encoding categorical variables have significant disadvantages, so if you have categorical data in your model, I would strongly recommend taking a look at Feature-Engine’s OneHotCategoricalEncoder.
In the early days of data science, many data scientists came with a math background and as a result I think the field took on some bad practices, at least from a computer science perspective. In this post, I’m going to introduce ten coding practices that will help you write better code.
You might say that better is a subjective term, however, I believe that there are concrete measurements to define good vs. bad code.
Good code is easy to understand and thus will take less time to write and most importantly debug
Good code is easy to maintain by other people besides the author
Writing code well will avoid hidden intent errors–ie errors that exist such that your code executes and does what it’s supposed to do most of the time. Intent errors are the worst because your code will appear to work, but all of a sudden, there will be some edge case or something you didn’t think about and now your code breaks. These are the most insidious errors.
Good code is efficient.
Ultimately, taking on good coding practices will result in fewer errors, which directly translates to more work (value) being delivered and less effort being spent on fixing and maintaining code. Apparently this is a bigger issue than I realized. When I was writing this article, this other article got posted to my Twitter feed: https://insidebigdata.com/2019/08/13/help-my-data-scientists-cant-write-production-code/. I’ll try not to duplicate the points this author made, but in general, the biggest difference that I see between code most data scientists write and production code is that data scientists tend not to think about reusability.
Happy New Year everyone! I’ve been taking a bit of a blog break after completing Learning Apache Drill, teaching a few classes, and some personal travel but I’m back now and have a lot planned for 2019! One of my long standing projects is to get Apache Drill to work with various open source visualization and data flow tools. I attended the Strata conference in San Jose in 2016 where I attended Maxime Beauchemin’s talk (slides available here) where he presented the tool then known as Caravel and I was impressed, really really impressed. I knew that my mission after the conference would be to get this tool to work with Drill. A little over two years later, I can finally declare victory. Caravel went through a lot of evolution. It is now an Apache Incubating project and the name has changed to Apache (Incubating) Superset.
UPDATE: The changes to Superset have been merged, so you can just install Superset as described on their website.
Last Friday, the Apache Drill released Drill version 1.14 which has a few significant features (plus a few that are really cool!) that will enable you to use Drill for analyzing security data. Drill 1.14 introduced:
A logRegex reader which enables Drill to read anything you can describe with a Regex
An image metadata reader, which enables you to query images
A suite a of GIS functionality
A collection of phonetic and string distance functions which can be used for approximate string matching.
These suite of functionality really expands what is possible with Drill, and makes analysis of many different types of data possible. This brief tutorial will walk you through how to configure Apache Drill to query log files, or any file really that can be matched with a regex.
One of the big issues I’ve encountered in my teaching is explaining how to evaluate the performance of machine learning models. Simply put, it is relatively trivial to generate the various performance metrics–accuracy, precision, recall, etc–if you wanted to visualize any of these metrics, there wasn’t really an easy way to do that. Until now….
Recently, I learned of a new python library called YellowBrick, developed by Ben Bengfort at District Data Labs, that implements many different visualizations that are useful for building machine learning models and assessing their performance. Many visualization libraries require you to write a lot of “boilerplate” code: IE just generic and repetitive code, however what impressed me about YellowBrick is that it largely follows the scikit-learn API, and therefore if you are a regular user of scikit-learn, you’ll have no problem incorporating YellowBrick into your workflow. YellowBrick appears to be relatively new, so there still are definitely some kinks to be worked out, but overall, this is a really impressive library.
IP addresses can be one of the most useful data artifacts in any analysis, but over the years I’ve seen a lot of people miss out on key attributes of IP addresses to facilitate analysis.
What is an IP Address?
First of all, an IP address is a numerical label assigned to a network interface that uses the Internet Protocol for communications. Typically they are written in dotted decimal notation like this: 188.8.131.52. There are two versions of IP addresses in use today, IPv4, and IPv6. The address shown before is a v4 address, and I’m going to write the rest of this article about v4 addresses, but virtually everything applies to v6 addresses as well. The difference between v4 and v6 isn’t just the formatting. IP addresses have to be unique within a given network and the reason v6 came into being was that we were rapidly running out of IP addresses! In networking protocols, IPv4 addresses are 32bit unsigned integers with a maximum value of approximately 2 billion. IPv6 increased that from 32bit to 128 bits resulting in 2128 possible IP addresses.
What do you do with IP Addresses?
If you are doing cyber security analysis, you will likely be looking at log files or perhaps entries in a database containing the IP address in the dotted decimal notation. It is very common to count which IPs are querying a given server, and what these hosts are doing, etc.
This post is a continuation of my previous tutorial about debugging code in which I discuss how preventing bugs is really the best way of debugging. In this tutorial, we’re going to cover more debugging techniques and how to avoid bugs.
Types of Errors:
Ok, you’re testing frequently, and using good coding practices, but you’ve STILL got bugs. What next?? Let’s talk about what kind of error you are encountering because that will determine the response. Errors can be reduced to three basic categories: syntax errors, runtime errors, and the most insidious intent errors. Let’s look at Syntax errors first.
Debugging code is a large part of actually writing code, yet unless you have a computer science background, you probably have never been exposed to a methodology for debugging code. In this tutorial, I’m going to show you my basic method for debugging your code so that you don’t want to tear your hair out.
In Programming Perl, Larry Wall, the author of the PERL programming language said that the attributes of a great programmer are Laziness, Impatience and Hubris:
Laziness: The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful, and document what you wrote so you don’t have to answer so many questions about it. Hence, the first great virtue of a programmer. (p.609)
Impatience: The anger you feel when the computer is being lazy. This makes you write programs that don’t just react to your needs, but actually anticipate them. Or at least pretend to. Hence, the second great virtue of a programmer. See also laziness and hubris. (p.608)
Hubris: Excessive pride, the sort of thing Zeus zaps you for. Also the quality that makes you write (and maintain) programs that other people won’t want to say bad things about. Hence, the third great virtue of a programmer. See also laziness and impatience. (p.607)
These attributes also apply to how to write good code so that you don’t have to spend hours and hours debugging code.
The Best Way to Avoid Errors is Not to Make Them
Ok… so that seems obvious, but really, I’m asking another question and that is: “How can you write code that decreases your likelihood of making errors?” I do have an answer for that. The first thing is to remember is that bugs are easy to find when they are small. To find bugs when they are small, write code in small chunks and test your code frequently. If you are writing a large program, write a few lines and test what you have written to make sure it is doing what you think it is supposed to do. Test often. If you are writing a script that is 100 lines, it is MUCH easier to find errors if you test your code every 10 lines rather than write the whole thing and test at the end. The better you get, the less frequently you will need to test, but still test your code frequently.
Good Coding Practices Will Help You Prevent Errors
This probably also seems obvious, but I’ve seen (and written) a lot of code that leaves a lot to be desired in the way of good practices. Good coding practices mean that your code should be readable and that someone who has never seen your code before should be able to figure out what it is supposed to do. Now I know a lot of people have the attitude that since they are the only one working on a particular piece of code, then they don’t need to put in comments. WRONG WRONG WRONG In response, I would ask you in 6 months, if you haven’t worked on this, would you remember what this code did? You don’t need to go overboard, but you should include enough comments so that you’ll remember the code’s purpose.
Here are some other suggestions:
Adopt a coding standard and stick to it: It doesn’t matter which one you use, but pick one and stick to it. That way, you will notice when things aren’t correct. Whatever you do, don’t mix conventions, ie don’t have column_total, columnTotal and ColumnTotal as variables in the same script.
Use descriptive variable names: One of my pet peeves about a lot of machine learning code is that they use X, Y as variable names. Don’t do that. This isn’t calculus class. Use descriptive variable names such as test_data, or target_sales, and please don’t use X, Y, or even worse, i, I, l and L as variable names.
Put comments in your code: Just do it.
Plan your program BEFORE you write it
I learned this lesson the hard way, but if you want to spend many hours writing code that doesn’t work, when faced with a tough problem, just dive right in and start coding. If you want to avoid that, get a piece of paper and a pen (or whatever system you like) and:
Break the problem down into the smallest, most atomic steps you can think of
Write pseudo-code that implements these steps.
Look for extant code that you can reuse
Once you’ve found reusable code, and you have a game plan of pseudo code, now you can begin writing your code. When you start writing, check every step against your pseudo code to make sure that your code is doing what you expect it to do.
Don’t Re-invent the Wheel
Another way to save yourself a lot of time and frustration is to re-use proven code to the greatest extent possible. For example, Python has a myriad of libraries available at Pypi and elsewhere which really can save you a lot of time. It is another huge pet peeve of mine to see people writing custom code for things which are publicly available. This means that before you start writing code, you should do some research as to what components are out there and available.
After all, if I were to ask you if you would rather:
Use prewritten, pretested and proven code to build your program OR
Write my own code that is unproven, untested and possibly buggy
the logical thing to do would of course be to do the first.
Great programmers never sit down at the keyboard and just start banging out code without having a game plan and without understanding the problem they are trying to solve. Hopefully by now you see that the first step in writing good code that you won’t have to debug is to plan out what you are trying to, reuse extant code and test frequently. In the next installment, I will discuss the different types of errors and go through strategies for fixing them.
I hope everyone is enjoying Thanksgiving! This week, there were several new developments in terms of data science tools which I would like to highlight. I am a big believer of staying up to date in terms of what new tools are being developed in that you can make yourself much more efficient by better using the available tools. Both tools highlighted here represent significant potential in terms of being able to get data more efficiently and being able to more effectively present data.
In my opinion, one of the things Drill did very poorly in previous versions was CSV parsing. In prior versions, when you used Drill to query a CSV file, Drill would store each row into an array called Columns, and if you were querying a CSV file in Drill you had to use the columns array and assign each column a name:
SELECT columns AS firstName, columns as lastName
This clearly was a less than optimal solution and results in very convoluted queries. However, with the advent of version 1.3, Drill now can be configured to derive the column names from the original CSV file. You can still configure drill to operate in the old manner, but I can’t imagine you’d want to, and you can write queries like this for CSV files:
SELECT firstName, lastName
Drill will still work with data that has no headers. It treats this kind of data as it used to in the past.
The HTTPD log parser still hasn’t made it into a stable version, but I’m following the conversation between the developers closely and it looks like it will be included in version 1.4.
Plot.ly Now Open Source
Have to pay a lot of money for BI tools such as Tableau or RShiny. OR
It is true that several easy to use libraries such as Bokeh, Seaborn, Vincent and a few others are getting a lot better. Also Apache Zeppelin is a promising notebook-like tool which enables quick, interactive data visualization, but I digress…
What is Plot.ly and Why Should I Care?
In any event, just as a quick demonstration, the code below generates a very nice interactive stacked area chart. (The code is from a Plot.ly tutorial and available here.)