Skip to content

Ten Good Coding Practices for Data Scientists

In the early days of data science, many data scientists came with a math background and as a result I think the field took on some bad practices, at least from a computer science perspective. In this post, I’m going to introduce ten coding practices that will help you write better code.

You might say that better is a subjective term, however, I believe that there are concrete measurements to define good vs. bad code.

  1. Good code is easy to understand and thus will take less time to write and most importantly debug
  2. Good code is easy to maintain by other people besides the author
  3. Writing code well will avoid hidden intent errors–ie errors that exist such that your code executes and does what it’s supposed to do most of the time. Intent errors are the worst because your code will appear to work, but all of a sudden, there will be some edge case or something you didn’t think about and now your code breaks. These are the most insidious errors.
  4. Good code is efficient.

Ultimately, taking on good coding practices will result in fewer errors, which directly translates to more work (value) being delivered and less effort being spent on fixing and maintaining code. Apparently this is a bigger issue than I realized. When I was writing this article, this other article got posted to my Twitter feed: https://insidebigdata.com/2019/08/13/help-my-data-scientists-cant-write-production-code/. I’ll try not to duplicate the points this author made, but in general, the biggest difference that I see between code most data scientists write and production code is that data scientists tend not to think about reusability.

In general (stereotyping here) data scientists do their experiments using R or Python and usually in some sort of notebook. While the notebooks are great for experimentation, they are not necessarily great for production-quality code writing.

Be Lazy, Write Good Code

In “Programming Perl”, 2nd Edition, Larry Wall, the author of the Perl programming language said that there are three great virtues of a programmer; Laziness, Impatience and Hubris

  1. Laziness: The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful and document what you wrote so you don’t have to answer so many questions about it.
  2. Impatience: The anger you feel when the computer is being lazy. This makes you write programs that don’t just react to your needs, but actually anticipate them. Or at least pretend to.
  3. Hubris: The quality that makes you write (and maintain) programs that other people won’t want to say bad things about.

I fully agree with the intent of what Larry was getting at. Great programmers get the most value out of computers by reducing overall energy expenditure and ultimately maximizing the value generated. Hopefully, my nine coding practices will help you do that.

So without further ado, my good coding practices are:

  1. Use Descriptive Variable Names
  2. Make good use of functions to organize code
  3. Use pre-existing libraries whenever possible
  4. Avoid long method chains (One operation per line if possible)
  5. Avoid highly nuanced and complicated language constructs when simple ones will do the trick
  6. Use an IDE like PyCharm or RStudio for production code… (Sorry Jupyter, I love you, but not for production)
  7. Don’t forget to include doc strings and other language specific documentation
  8. Write unit tests as you go
  9. Adopt and adhere to a style standard like PEP8 or equivalent.
  10. Learn to use and love loggers

In general, these rules are somewhat aspirational, and there are obviously circumstances when you don’t follow them, but in general, I would suggest that if you do, you will write better code and spend less time doing so.

Use Descriptive Variable Names (not x or y)

One of the hallmarks of well written code is that someone who is not familiar with the code can easily follow the sequence of events. Consider the following code:

feature_names =['num_cylinders','length','weight']
target_name = ['mpg']

features = data[feature_names]
target = data[target_name]

clf = DecisionTreeClassifier()
clf.fit(features, target)

This code uses clear and descriptive variable names so that someone who knows the language, even someone who isn’t familiar with machine learning, can follow the flow of what is happening. In computer science speak, this is known as self-documenting code because there is no need to document it. The code itself is clearly written so that no documentation is necessary.

Now, contrast this with the following code:

X = data[0:3]
y = data[:4]

clf = DecisionTreeClassifier()
clf.fit(X, y)

Hopefully, you can how much more descriptive the first example is than the second. In the second, I have literally no idea what problem is being solved, or anything whereas in the first the code is easy to follow. Which leads me to my next point: using functions to organize code.

Use Functions to Organize Your Code

Often for data science code, we’ll have scripts that basically do this:

# get data
df0 = pd.read_csv('somefile.csv')
df1 = pd.read_json('some_other_file.json')
df2 = pd.read_sql('SELECT...')
df = pd.merge([df0, df1, df2], ..)

# clean data

# split into train/test sets

# train model

# eval performance

My last project, the data gathering and cleaning took quite a few lines of code. What if instead of doing that, you had code that looked like this:

df = getData()
clean_df = cleanData()

Isn’t that nicer? In all seriousness, organizing your code this way will make it easier to maintain and it also will enable you to see the steps in your logic. And while we’re on the subject of functions…

Use Pre-existing Modules Whenever Practical and Possible

One of the best things about Python is the sheer number of modules that are available for it. Using these modules is not just a major time saver in terms of code that you don’t have to write, but also in terms of code you have to debug, and this is the most important part.

I worked on a project a while ago which involved breaking up URLs into pieces. This is a seemingly simple task, however there are a lot of edge cases. So you could pretty easily write code that does this, but what you’ll inevitably find is that there are some cases which will break your code. (You would discover this with unit tests, but that’s another point…). In any event, it turns out that there is a module called tldextract that does this for you… The point being is that you can spend your time debugging trivialities in your code OR you can reuse someone else’s module and spend your time on the actual problem you are trying to solve.

Avoid Long Method Chains

Now, this one is not a universal one, but many methods can be chained together as shown in the example below:

x = df.method1().method2().method3().method4().go()

You see a lot of this in JavaScript where it is much more acceptable. I am not a fan of this because it can be difficult to debug and it can be difficult to understand exactly what is happening. Let’s consider the example of three functions, foo1 accepts an integer and returns an integer, foo2 accepts an integer and returns a string, and foo3 accepts a string and returns a float. You could therefore string these together x.foo1().foo2().foo3() and that would theoretically work. However, if you mixed the order of these functions, you will get errors that can be difficult to understand and debug.

My philosophy is to try to put one operation per line. I can’t say I always do this, but in general, I believe this to be a good practice.

Avoid Highly Nuanced Trickery

This too may be a little controversial, but I am not a big believer in using a lot of python trickery in code. The rationale for this is that it is difficult to understand. The specific things that I avoid are:

  • List comprehensions
  • Slicing
  • Method chaining
  • Decorators

I’m sure there others but I like to keep my code simple. Along these same lines, try to avoid using regex when simple string manipulation will do. A word about slicing… In general, I do like Python’s ability to quickly get sections of lists and data structures. I like it a lot less when used with NumPy arrays and in general, I find it difficult to determine at a glance, what the code is actually doing and if you don’t know what the coding is doing, the opportunity for bugs to creep in is greater.

Use an Integrated Development Environment (IDE)

Ok, this is a big one. When I started coding, back in the dark ages, I coded using a text editor and thought that this was the right way to do it. For years, I tried using Eclipse to do some Java development and/or PHP work, but I always found that it was too complicated or wasn’t worth the time… Then Drill happened.

I started developing some pretty basic stuff for Apache Drill, and someone told me that I need to be using an IDE. I was never a big fan of Eclipse, so I decided to try IntelliJ and was hooked. The IDE helps you write better code by autoformatting your code, syntax highlighting, automatically looking up functions, catching all kinds of errors, etc. Additionally, IDEs have a series of debugging tools which will help you catch your mistakes without littering your code with print statements.

The bottom line here is using an IDE will help you write better code, making fewer mistakes in the process. Jupyter Notebooks are great for experimentation, but when it is time to write production code, I would strongly suggest that using an IDE such as PyCharm, Sypder or RStudio.

Include Docstrings

You’ve noticed that a central theme of my tips here is documentation. There are two reasons for this. The most obvious is that when you document your code, you enable others to understand what your code does, or at least what it is supposed to do. There is a second benefit. When you document code it forces YOU to understand your code. This may seem obvious since you wrote it, but quite often, when writing complex code, even the author doesn’t fully understand the code.

So, back to the topic, docstrings are special comments that are included in classes, functions and methods. These functions automatically generate documentation such as Javadocs or in the case of Python, when you call help(). Including docstrings in your code is like writing yourself a little note for the future. It will be a major timesaver as well because many IDEs can look up docstrings on the fly. This will help you six months down the road when you can’t remember what input that function you wrote called getAndParseData() actually needs.

In case you aren’t familiar with docstrings, here are some examples:

For R: (https://cran.r-project.org/web/packages/docstring/vignettes/docstring_intro.html)

test <- function(){
    #' This is my title line
    #'
    #' All of this text goes
    #' in the Description section
    #'
    #' This part goes in the Details!
    return()
}

?test

This is my title line

Description:

     All of this text goes in the Description section

Usage:

     test()
     
Details:

     This part goes in the Details!

For Python: (https://www.datacamp.com/community/tutorials/docstrings-python)

class Vehicles(object):
    '''
    The Vehicles object contains lots of vehicles

    Parameters
    ----------
    arg : str
        The arg is used for ...
    *args
        The variable arguments are used for ...
    **kwargs
        The keyword arguments are used for ...

    Attributes
    ----------
    arg : str
        This is where we store arg,
    '''
    def __init__(self, arg, *args, **kwargs):
        self.arg = arg

    def cars(self, distance, destination):
        '''We can't travel distance in vehicles without fuels, so here is the fuels

        Parameters
        ----------
        distance : int
            The amount of distance traveled
        destination : bool
            Should the fuels refilled to cover the distance?

        Raises
        ------
        RuntimeError
            Out of fuel

        Returns
        -------
        cars
            A car mileage

Write Unit Tests

This also is a big concept and one that I wish someone had taught me a long time ago. (Big thank you to Paul Rogers for introducing me to the wonders of unit testing!!) If you aren’t familiar with unit testing, the basic idea is that as you write code (yes… AS you write code), you also write a series of automated tests which verify the function’s functionality. Most languages have automated frameworks for this. Python has unittest, Java has junit, R has a module called testthat, etc.

The pseudocode below demonstrates how unit tests are supposed to work. The basic idea is that you write automated tests which test the function’s output against a variety of conditions. For example, if you are writing a function, you would want to test against a variety of input types, including empty or null values, and verify that the function perfoms as you expect it to.

def doSomething(x):
   # Do something with x
   return x

resultWithInt = doSomething(5)
assertEquals(resultWithInt, 10)

resultWithFloat = doSomething(4.0)
assertEquals(resultWithFloat, 8.0)

Writing unit tests may seem really unnecessary, until you realize that quite often your code is part of a larger system, and you need to know how it will perform in a variety of situations. Also, whenever you make changes to code, quite often you may introduce other bugs (AKA regressions) which you can catch if you wrote unit tests.

Writing unit tests will also make ops and other development teams happy because they will know that your code will behave in a predictable manner.

Adopt a Consistent Coding Style

This is another one that most data scientists don’t seem to be aware of, but in Python at least, there are official style guides such as PEP 8 (https://www.python.org/dev/peps/pep-0008/). Now it does not matter which style guide you use, but you and your team should pick one and adhere to it. What’s really nice is that if you are using an IDE (see tip 6), they can enforce adherence to a particular naming conventions, so that you don’t have to learn them.

Using a style guide will help you avoid preventable errors, and in general make your code more understandable. As an example, PEP8 has naming conventions for variables. Consider what will happen if one developer on your team writes variable names using camelcase and another who uses underscores. You’ll have a disaster on your hands if their code has to be merged. Likewise PEP8 has standards for class names, tabs vs. spaces, constants formatting etc.

Use a Logging Module

Wow! You made it to tip number 10!! Good for you! The final coding tip I have is to use a logging module. One key difference between production and non-prod code is logging. Let’s consider the following situation: you have some piece of complex code that executes but doesn’t quite do what you are expecting. What do you do? From my observations, most data scientists aren’t that familiar with debuggers, so they will put print() statements all over the code to determine where it is breaking.

This isn’t necessarily a bad approach but Python, Java and pretty much every language has a better way of doing it, and that is with a logging module. I’m most familiar with Python’s so I’ll demonstrate how that works in the snippet below.

import logging
FORMAT = '%(asctime)-15s %(clientip)s %(user)-8s %(message)s'
logging.basicConfig(format=FORMAT)
d = {'clientip': '192.168.0.1', 'user': 'fbloggs'}
logger = logging.getLogger('tcpserver')
logger.warning('Protocol problem: %s', 'connection reset', extra=d)
logger.info("This is just for your information")

x = 4
logger.debug("X is set to: ", x)

There are several advantages to this:

  1. When you execute your code, you can set the logging level to reflect what you are trying to do. IE if you are trying to debug your code, you can display only the debugging messages, and if you are trying to run your code in production, you can leave only the messages that would be critical.
  2. Unlike the print statement approach, you can leave these in your code when you are done, so that if bugs creep in later, you don’t have to mess with your code to fix them
  3. You can redirect your logs to files so that you can keep a constant eye on execution times, data volumes etc.

TL;DR

This post ended up a bit longer than I originally intended, but I hope that my list will help you write better code and ultimately save time doing it. To recap, here’s the list:

  1. Use Descriptive Variable Names
  2. Make good use of functions to organize code
  3. Use pre-existing libraries whenever possible
  4. Avoid long method chains (One operation per line if possible)
  5. Avoid highly nuanced and complicated language constructs when simple ones will do the trick
  6. Use an IDE like PyCharm or RStudio for production code… (Sorry Jupyter, I love you, but not for production)
  7. Don’t forget to include doc strings and other language specific documentation
  8. Write unit tests as you go
  9. Adopt and adhere to a style standard like PEP8 or equivalent.
  10. Learn to use and love loggers

Share the joy

3 Comments

  1. supplier supplier

    Good content. Thanks for info

  2. thirusai12345 thirusai12345

    Great article with good information which is worth reading it. Thanks

  3. supplier2344 supplier2344

    Good article with best info which is very useful . Thanks

Leave a Reply

Your email address will not be published.