The image above is a late 50’s MGA convertible. I picked this because I happen to think that this car is one of the most elegantly designed cars ever made. Certainly in the top 50. While we as people place a lot of emphasis on design when it comes to physical objects that we use, when it comes to software, a lot of our software’s design looks more like the car below: This vehicle looks like it was designed by a committee that couldn’t decide whether they were designing a door stop or a golf cart.
I’ve been doing a lot of thinking lately (for reasons that will become apparent in a few weeks) about the lack of good software engineering practices in the data science space. The more I think about it, the more I am shocked with the questionable design that exists in the data science ecosystem, particularly in Python. What’s even more impressive is that the python language is actually designed to promote good design, and the people building these modules are all great developers.
As an example of this, I recently read an article about coding without if statements and it made me cringe. Here is a quote:
When I teach beginners to program and present them with code challenges, one of my favorite follow-up challenges is: Now solve the same problem without using if-statements (or ternary operators, or switch statements).
You might ask why would that be helpful? Well, I think this challenge forces your brain to think differently and in some cases, the different solution might be better.
There is nothing wrong with using if-statements, but avoiding them can sometimes make the code a bit more readable to humans. This is definitely not a general rule as sometimes avoiding if-statements will make the code a lot less readable. You be the judge.https://medium.com/edge-coders/coding-tip-try-to-code-without-if-statements-d06799eed231
Well, I am going to be the judge and every example the author cited made the code more difficult to follow. The way I see it, hard to follow code means that you as the developer are more likely to introduce unintended bugs and errors. Since the code is difficult to follow, not only will you have more bugs, but these bugs will take you more time to find. Even worse, if you didn’t write this code and are asked to debug it!! AAAAHHHH!! I know if that were me, I wouldn’t even bother. I’d probably just rewrite it. The only exception would be if it was heavily commented.
I’ll also add that when you have code like this, it is more difficult to write unit tests for it that take into account all the various things that could go wrong. Consider the following code:
if x > 100: # do something elif x >= 50: # do something else else: # option 3
This code is super easy to test and prove that it it works in all cases. Clearly you have to write a unit test for each if clause, and then for some edge cases, such as x is undefined or x is not a number. Even a non-coder would not have much difficulty figuring out what is going on here.
Python Encourages Good Design
The developers of Python recognized that good design is a vital component of good coding and structured the language to encourage good practices. As an example, I’m going to quote PEP20: The Zen of Python.
Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!
These are very good principles to think about when writing Python code. Unfortunately despite the founders best intentions, developers found many ways to create difficult and even horrendous code with Python.
Let’s take a look at Pandas. Let me start by saying that Pandas is a wonderful module. It’s simply brilliant and a huge game changer in how you code. With that said, Pandas regularly violates the principle of “There should be one– and preferably only one– obvious way to do it.”
In pandas, you can access columns in a dataframe either with dotted notation:
OR you can use bracket notation:
The problem here is that these are not functionally equivalent! There are several major gotcha’s involved with the dot notation, but the fact that both exist is problematic. Additionally, there are many methods such as
drop_duplicates() which seem functionally equivalent but in fact are not. Unfortunately, usually you don’t discover the difference between these kinds of functions until 2AM when you’ve been coding all day and are trying to debug some code that looks like it should work, but doesn’t.
Let’s consider another bit of awesomeness: Matplotlib. On the one hand, this library is awesome. On the other hand, the design is incredibly confusing. For me, MatPlotLib was a complete mystery until I read Effectively Using Matplotlib by Chris Moffitt, which explained that there are two interfaces for MatPlotLib: a state based interface and an object oriented interface. WHAT??!?
The image on the left illustrates all the components of a visualization rendered with MatPlotLib. There’s a lot here, but the issue is that there are many ways to modify each one. If there was one, and preferably one obvious way to change these, I suspect that MatPlotLib would make a lot more sense to a lot more people.
I almost forgot, I have another gripe with Pandas and that is the json.normalize function. This is an awesome package that, from my experience, a lot of people don’t know about. The ability to properly read nested JSON data is hugely important, and yet, in order to do that, you have to know about this package and specially import it.
Design is Important, Especially for Software
It seems silly to state this, but good design is really important. From a practical perspective, good design saves money. How so you might ask? Well, let’s start with the obvious, a well designed product will take less time for a user to learn and adopt, thereby increasing productivity. A well designed product will take less training to teach users how to use it and fewer resources to support.
For the developers of open source software, you have a vested interest in good design as well. If you create a well designed product, more people will use it, and fewer people will bother you with “how do I?” questions 😉
Learn More at my Upcoming Workshop!
If you’re interested in learning more about how to write well designed code that will get through code reviews and into production easier, then you might want to check out my upcoming workshop on Safari Online: https://learning.oreilly.com/live-training/courses/first-steps-writing-cleaner-code-for-data-science-using-python-and-r/0636920391661/