I saw this image on LinkedIn a few days ago and realized that it is proof of the future of data science. The image is of the leaderboard from a Kaggle competition, which isn’t particularly remarkable, but what is remarkable is the competitor in 2nd place: Google AutoML. Not only did AutoML come in 2nd, but it did so in fewer entries and the score was 0.00093 off of first place.
You might be thinking, “Well that’s Google, and they have the best stuff. Technology like that will not be available to the masses anytime soon, and if it is, it will require massive clusters.” Au contraire mon frère. There are a slew of new Python modules which automate various phases of the machine learning process. My personal favorite of which is called TPOT. TPOT is a Python module which automates the entire machine learning process, and generates python code for your entire pipeline.
I did a little experiment with TPOT and was able to build a model with data from Kaggle that scored in the top 10 for a simple exercise.
At some point, Google will likely make their AutoML available to the public if it isn’t already, and data scientists will have to prove that their value over automated machine learning tools.
So What?
The significance of this is enormous. Since the coining of the term data science, many people have focused very heavily on the math and machine learning aspects of data science. These aspects are certainly important, but these steps can be automated, as you can see, with ever improving performance. What this means in the long run is that as available computing power increases and these tools get better and faster, the understanding of the inner workings of the algorithms will become less and less important. (This is not true if you are working at a really cutting edge company that is developing new algorithms, or doing academic research.)
Therefore, if you are a data scientist or an aspiring data scientist, should you quit now? Hardly. Automated ML is really exciting because it will enable you to focus on the things that computers can’t do, and likely won’t ever be able to do, which are: conceiving and defining data problems, communicating the results to stakeholders, as well as the data cleaning/feature engineering steps. Automated machine learning will enable or force data scientists to focus on tasks that truly require human thought and using data science to add value to their organizations.