Let me start out by saying this is purely hypothetical as I’ve never been to a data science bootcamp, but I have taught them and have reviewed MANY curricula. I’ve also mentored a decent number of bootcamp graduates. In general, what I see is that bootcamps tend to place a lot of emphasis on machine learning but there’s a lot more to being a successful data scientist. The list below are five areas which I believe would benefit any aspiring data scientist.
SQL
Let’s start with an easy one. SQL. Despite all the trashing that it gets, SQL is an enormously useful language to know. Despite all the hype one hears about NoSQL and other non-relational datastores, SQL is still in widespread use and is not likely to go anywhere anytime soon. Let me tell you why you should learn SQL….
Firstly, a lot of data is in relational databases, but on top of that, there are more and more systems which query big data stores that use SQL or SQL-like languages. This would include platforms like Drill, Presto and Impala, but also tools like Cassandra, Hive, Apache Spark. So knowing SQL means that you can interact with these systems.
The second reason is that you can do a lot of data processing and cleanup directly on the source system if you are good at SQL. One thing I have observed with many new data scientists is that they will query a database (or some other SQL based system), pull back their data, and then perform aggregation, and other summarization. The issue with this approach is that it will often be WAY easier and more efficient to do that data cleanup directly on the database, but you have to know SQL to do that. Also this enables you to take full advantage of the database’s processing power instead of having to download the data to your local machine to process it.
Finally, SQL makes it easy to share data with others. If I have a large dataset that I want to share, it is much easier to share a short SQL query with someone rather than sending the entire data set. If you are interested in learning SQL, I recommend SQL for Mere Mortals by John Viescas or Learning SQL by Alan Beaulieu.
Good Software Design Practices
This is a big topic, but from what I’ve seen of data science bootcamps, most of the students do not have a computer science background and you really see this hurt them in their coding practices. Now, most data science programs use Python or R which are amazing languages, however, being able to code well is a vital part of writing code that finds its way into production systems. I wrote this post last year about good coding practices for data scientists and I’m not going to rehash it here, for any aspiring data scientist, I would strongly recommend learning about good coding practices.
Good coding practices will save you time, decrease frustration (and f-bombs), help you write more maintainable code, and in general help you deliver value faster to your stakeholders. I’m planning on releasing a video series about this topic so stay tuned.
Data Engineering
Maybe this should be higher up on the list, but one of the things that I dislike about Kaggle is that you are handed data on a virtual silver platter and a data scientist (or automated ML program) has to find the optimal algorithm and tuning parameters to win the contest. From my experience, this is not where data scientists struggle and indeed, more and more tools are being introduced that automate this work. Where data scientists struggle is getting data from raw sources in a real world environment. In the real world, data is often incomplete, corrupt or otherwise difficult. It can come from a variety of systems and formats, and virtually never is as simple as pd.read_csv()
.
Now, I’m not suggesting that data science bootcamps haze their new students with awful data, but I do think it is extremely valuable for new data scientists to have an understanding of how things work in real enterprises. For example, a basic understanding of relational databases and their function in the enterprise is very helpful as many organizations have data stored in databases. I won’t dwell on this point, but to all aspiring data scientists, I would make sure that you have at least a conceptual understanding of how various data platforms work because you never know when you may have to work with one of them.
Enterprise IT Management
This is another area which I think could be better taught in bootcamps: enterprise IT management. I suspect that a lot of data scientists graduate from their programs, get jobs with large organizations and find themselves in conflict with the IT groups in their company for one reason or another. To the new data scientists, they don’t understand why IT doesn’t have the latest version of software package X or why they have to work on a Windows 7 machine with Internet Explorer 8. (Ok… I wonder that also).
From the flip side, the IT managers and engineers probably look at the data scientists as troublesome pests who are trying subvert their carefully orchestrated systems with their fancy “Jupyter notebooks”. </sarcasm>
In all seriousness, I’ve primarily worked with large organizations that existed before the age of the internet, so I think I’m a little biased here, but from my experience, data scientists need to understand how to work with enterprise IT. Firstly, let me prepare you that older organizations may have to deal with pesky rules and regulations about how their data is used. In many cases, if this data is misused or leaked to the wrong parties within the company or the public, it could result in massive fines, legal ramifications or in the case of medical organizations, even deaths. Needless to say, organizations like this take their IT and data very seriously as the stakes are a lot higher if something goes wrong. As you might expect, organizations like this have many controls in place to make sure that their systems maintain levels of uptime. The thing to understand is that these processes are fairly standard and well understood across industries. Bottom line: understanding these processes will help you work WITH your engineers and IT staff instead of against them.
Creating Business Value
My final topic is understanding how to actually create business value. This requires an in-depth understanding of the business in which you are working, the available data and the problems they face. In my experience, quite often, businesses don’t need a whiz-bang deep learning solution right away. I would encourage data scientists, especially those who are entering established organizations without a long history of using data, to adopt an entrepreneurial approach to their work. Since many stakeholders will not be familiar with the tools and techniques of data science, it is incumbent upon the data scientist to educate them, in a non-confrontational, non-condescending manner so that they will support your work.
I hope I’ve given you something to think about as you go forward in your data science career. If I can be of help, please don’t hesitate to contact me.