Category: Coding Tips

Tutorial: Visualizing Machine Learning Models

Tutorial: Visualizing Machine Learning Models

Published July 24, 2017

One of the big issues I’ve encountered in my teaching is explaining how to evaluate the performance of machine learning models. Simply put, it is relatively trivial to generate the various performance metrics–accuracy, precision, recall, etc–if you wanted to visualize any of these metrics, there wasn’t really an easy way to do that. Until now….

Decision Boundary

Recently, I learned of a new python library called YellowBrick, developed by Ben Bengfort at District Data Labs, that implements many different visualizations that are useful for building machine learning models and assessing their performance. Many visualization libraries require you to write a lot of “boilerplate” code: IE just generic and repetitive code, however what impressed me about YellowBrick is that it largely follows the scikit-learn API, and therefore if you are a regular user of scikit-learn, you’ll have no problem incorporating YellowBrick into your workflow. YellowBrick appears to be relatively new, so there still are definitely some kinks to be worked out, but overall, this is a really impressive library.

2 Comments

Doing More with IP Addresses

Doing More with IP Addresses

Published February 6, 2017

IP addresses can be one of the most useful data artifacts in any analysis, but over the years I’ve seen a lot of people miss out on key attributes of IP addresses to facilitate analysis.

What is an IP Address?

First of all, an IP address is a numerical label assigned to a network interface that uses the Internet Protocol for communications. Typically they are written in dotted decimal notation like this: 128.26.45.188. There are two versions of IP addresses in use today, IPv4, and IPv6. The address shown before is a v4 address, and I’m going to write the rest of this article about v4 addresses, but virtually everything applies to v6 addresses as well. The difference between v4 and v6 isn’t just the formatting. IP addresses have to be unique within a given network and the reason v6 came into being was that we were rapidly running out of IP addresses! In networking protocols, IPv4 addresses are 32bit unsigned integers with a maximum value of approximately 2 billion. IPv6 increased that from 32bit to 128 bits resulting in 2¹²⁸ possible IP addresses.

What do you do with IP Addresses?

If you are doing cyber security analysis, you will likely be looking at log files or perhaps entries in a database containing the IP address in the dotted decimal notation. It is very common to count which IPs are querying a given server, and what these hosts are doing, etc.

Tips for Debugging Code without F-Bombs – Part 2

Published October 9, 2016

This post is a continuation of my previous tutorial about debugging code in which I discuss how preventing bugs is really the best way of debugging. In this tutorial, we’re going to cover more debugging techniques and how to avoid bugs.

Types of Errors:

Ok, you’re testing frequently, and using good coding practices, but you’ve STILL got bugs. What next?? Let’s talk about what kind of error you are encountering because that will determine the response. Errors can be reduced to three basic categories: syntax errors, runtime errors, and the most insidious intent errors. Let’s look at Syntax errors first.

1 Comment

Tips for Debugging Code without F-Bombs – Part 1

Published January 5, 2016

Debugging code is a large part of actually writing code, yet unless you have a computer science background, you probably have never been exposed to a methodology for debugging code. In this tutorial, I’m going to show you my basic method for debugging your code so that you don’t want to tear your hair out.

In Programming Perl, Larry Wall, the author of the PERL programming language said that the attributes of a great programmer are Laziness, Impatience and Hubris:

Laziness: The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful, and document what you wrote so you don’t have to answer so many questions about it. Hence, the first great virtue of a programmer. (p.609)

Impatience: The anger you feel when the computer is being lazy. This makes you write programs that don’t just react to your needs, but actually anticipate them. Or at least pretend to. Hence, the second great virtue of a programmer. See also laziness and hubris. (p.608)

Hubris: Excessive pride, the sort of thing Zeus zaps you for. Also the quality that makes you write (and maintain) programs that other people won’t want to say bad things about. Hence, the third great virtue of a programmer. See also laziness and impatience. (p.607)

These attributes also apply to how to write good code so that you don’t have to spend hours and hours debugging code.

The Best Way to Avoid Errors is Not to Make Them

Ok… so that seems obvious, but really, I’m asking another question and that is: “How can you write code that decreases your likelihood of making errors?” I do have an answer for that. The first thing is to remember is that bugs are easy to find when they are small. To find bugs when they are small, write code in small chunks and test your code frequently. If you are writing a large program, write a few lines and test what you have written to make sure it is doing what you think it is supposed to do. Test often. If you are writing a script that is 100 lines, it is MUCH easier to find errors if you test your code every 10 lines rather than write the whole thing and test at the end. The better you get, the less frequently you will need to test, but still test your code frequently.

Good Coding Practices Will Help You Prevent Errors

This probably also seems obvious, but I’ve seen (and written) a lot of code that leaves a lot to be desired in the way of good practices. Good coding practices mean that your code should be readable and that someone who has never seen your code before should be able to figure out what it is supposed to do. Now I know a lot of people have the attitude that since they are the only one working on a particular piece of code, then they don’t need to put in comments. WRONG WRONG WRONG In response, I would ask you in 6 months, if you haven’t worked on this, would you remember what this code did? You don’t need to go overboard, but you should include enough comments so that you’ll remember the code’s purpose.

Here are some other suggestions:

Adopt a coding standard and stick to it: It doesn’t matter which one you use, but pick one and stick to it. That way, you will notice when things aren’t correct. Whatever you do, don’t mix conventions, ie don’t have column_total, columnTotal and ColumnTotal as variables in the same script.
Use descriptive variable names: One of my pet peeves about a lot of machine learning code is that they use X, Y as variable names. Don’t do that. This isn’t calculus class. Use descriptive variable names such as test_data, or target_sales, and please don’t use X, Y, or even worse, i, I, l and L as variable names.
Put comments in your code: Just do it.
Put one operation per line: I know that especially in Python and JavaScript, it is fashionable (and “Pythonic”) to cram as many operations onto one line as possible via method chaining. I personally think in series of steps and it is easier to see the logic (and hence any mistakes) if you have one action per line.

Plan your program BEFORE you write it

I learned this lesson the hard way, but if you want to spend many hours writing code that doesn’t work, when faced with a tough problem, just dive right in and start coding. If you want to avoid that, get a piece of paper and a pen (or whatever system you like) and:

Break the problem down into the smallest, most atomic steps you can think of
Write pseudo-code that implements these steps.
Look for extant code that you can reuse

Once you’ve found reusable code, and you have a game plan of pseudo code, now you can begin writing your code. When you start writing, check every step against your pseudo code to make sure that your code is doing what you expect it to do.

Don’t Re-invent the Wheel

Another way to save yourself a lot of time and frustration is to re-use proven code to the greatest extent possible. For example, Python has a myriad of libraries available at Pypi and elsewhere which really can save you a lot of time. It is another huge pet peeve of mine to see people writing custom code for things which are publicly available. This means that before you start writing code, you should do some research as to what components are out there and available.

After all, if I were to ask you if you would rather:

Use prewritten, pretested and proven code to build your program OR
Write my own code that is unproven, untested and possibly buggy

the logical thing to do would of course be to do the first.

In Conclusion

Great programmers never sit down at the keyboard and just start banging out code without having a game plan and without understanding the problem they are trying to solve. Hopefully by now you see that the first step in writing good code that you won’t have to debug is to plan out what you are trying to, reuse extant code and test frequently. In the next installment, I will discuss the different types of errors and go through strategies for fixing them.

A Few Exciting Tool Announcements!

Published November 27, 2015

I hope everyone is enjoying Thanksgiving! This week, there were several new developments in terms of data science tools which I would like to highlight. I am a big believer of staying up to date in terms of what new tools are being developed in that you can make yourself much more efficient by better using the available tools. Both tools highlighted here represent significant potential in terms of being able to get data more efficiently and being able to more effectively present data.

Apache Drill Releases version 1.3.

On 23 November, the Drill team released Drill version 1.3. The complete release notes are available here, but for me, the biggest improvement is the text file header parsing.

In my opinion, one of the things Drill did very poorly in previous versions was CSV parsing. In prior versions, when you used Drill to query a CSV file, Drill would store each row into an array called Columns, and if you were querying a CSV file in Drill you had to use the columns array and assign each column a name:
SELECT columns[0] AS firstName, columns[1] as lastName FROM cp.`somefile.csv`
This clearly was a less than optimal solution and results in very convoluted queries. However, with the advent of version 1.3, Drill now can be configured to derive the column names from the original CSV file. You can still configure drill to operate in the old manner, but I can’t imagine you’d want to, and you can write queries like this for CSV files:
SELECT firstName, lastName FROM cp`.somefile.csv`

Drill will still work with data that has no headers. It treats this kind of data as it used to in the past.

The HTTPD log parser still hasn’t made it into a stable version, but I’m following the conversation between the developers closely and it looks like it will be included in version 1.4.

Plot.ly Now Open Source

If you are into data visualization (and what data scientist shouldn’t be?) you’ll be pleased to know that as of a few days ago, the JavaScript library Plot.ly is now completely free and open source. I teach a lot of data science classes and clearly a subject which we feature in our training is data visualization. The unfortunate reality that I have encountered is that if you want to create really nice visualizations quickly, you either:

Have to pay a lot of money for BI tools such as Tableau or RShiny. OR
Learn to code in JavaScript and create them using D3.

It is true that several easy to use libraries such as Bokeh, Seaborn, Vincent and a few others are getting a lot better. Also Apache Zeppelin is a promising notebook-like tool which enables quick, interactive data visualization, but I digress…

What is Plot.ly and Why Should I Care?

Plot.ly is a JavaScript framework for easily making beautiful interactive visualizations, however you don’t actually have to know JavaScript to use it for visualizations. While Plot.ly is a JavaScript library, it also has APIs for Python / Jupyter Notebooks, R, Excel and a few others. Most of this was open source, but until last week, the JavaScript library that actually generated the visualization was closed. No Longer!
In any event, just as a quick demonstration, the code below generates a very nice interactive stacked area chart. (The code is from a Plot.ly tutorial and available here.)

 
 import cufflinks as cf
 import pandas.io.data as web
 from datetime import datetime

 start = datetime(2008, 1, 1)
 end = datetime(2008, 11, 28)

 df_gis = web.DataReader("GIS", 'yahoo', start, end)
 df_fdo = web.DataReader("FDO", 'yahoo', start, end)
 df_sp = web.DataReader("GSPC", 'yahoo', start, end)
 df = pd.DataFrame({'General Mills': df_gis.Open, 'Family Dollar Stores': df_fdo.Open, 'S&P 500': df_sp.Open})
 
df.head()
df.iplot(kind='line', fill=True,
    yTitle='Open Price', title='Top Recession Stocks',
    filename='cufflinks/stock data', world_readable')

Here is the output for that code:

There is a very thorough tutorial about Plot.ly available here. Installing Plot.ly is very easy as well. All you have to do is:

pip install plotly

That’s it! Enjoy!

Tutorial: Using Apache Zeppelin with MySQL

Published November 20, 2015

I’ve been playing with Apache Zeppelin for a little while now, and have been really impressed. If you aren’t familiar with Zeppelin, it is a tool for creating interactive notebooks to visualize data. With the latest version, Zeppelin includes an interpreter for PostgreSQL and I discovered that you can use this interpreter to connect Zeppelin to a MySQL server and quickly visualize your data.

7 Comments