People often ask me questions about starting a career in data science or for advice what tech skills they should acquire. When I get asked this question, I try to have a conversation with the person to see what their goals and aspirations are as there’s no advice that I can give that is universal, here are five pointers that I would say are generally helpful for anyone starting a career in data science or data analytics.
Tip 1: Data Science is a Big Field: You Can’t Know Everything About Everything:
When you start getting into data science, the breadth of the field can be overwhelming. It seems that you have to be an expert in big data systems, relational databases, computer sciences, linear algebra, statistics, machine learning, data visualization, data engineering, SQL, Docker, Kubernetes, and much more. To say nothing about subject matter expertise. One of the big misunderstandings I see is the perception that you have to be an expert in all these areas to get your first job.
The first thing that I tell many aspiring data scientists, is that the most important thing that a data scientist can do is extract value from data. Make that your mantra and let it guide you in your skill development. If your company are big users of tool X, then learn tool X. But don’t feel that you must be an expert in tools X, Y and Z to call yourself a data scientist. You don’t and in my view, it’s better to have a solid footing in a few tools and techniques than a shallow understanding of many.
There are a few exceptions. I do believe that anyone who wants a career in data scientist should have a solid understanding of SQL, a few basic machine learning algorithms and one of the scripting languages commonly used in data science like Python or R. Really though, as I’ve said earlier, data science is about getting value from data, and what is more important than all that is the ability to understand a business problem and apply data science techniques to solving that problem.
Tip 2: Solve for Efficiency
The skills and tools you should focus on are the ones that enable you to solve business problems as quickly and efficiently as possible. Take automated machine learning as an example. If you’re not familiar with AutoML, I’d encourage you to take a look at TPOT, which is an open source AutoML library. Once you’ve extracted your features, TPOT uses genetic programming to find the best machine learning pipeline and even generates python code for this pipeline.
What’s important here is that TPOT and similar commercial offerings are making it easier and easier to build machine learning models. What this means to aspiring data scientists is that unless you are interested in working on algorithm development, this is an area you probably shouldn’t spend too much time on as it is very likely to be automated in the foreseeable future. I suspect that many data scientists realize this and are perhaps a little scared of this reality. While TPOT and other automated solutions won’t always get you the best model, they will come very close and the question becomes whether it is a good use of time to go after the 0.02% improvement in model performance.
Data ingestion, cleaning and ETL in general is a major drain on most data scientists’ time. For a long time, I have been a big fan of the Apache Drill project which enables you to query self-describing data using SQL. Since there is a python module to which can query Drill and seamlessly import the data into a pandas dataframe, it suddenly becomes trivial (and time efficient) to query arbitrary data, and get it into a vectorized data structure. What’s more is you can couple this with auto-summarizing libraries such as pandas-profilingyou can go from raw data to exploratory summaries in about 2-3 lines of code. Combine this with the aforementioned automated machine learning tools and you can be building models in considerably less time than if you were to be doing all this manually.
Tip 3: Data Is Never Clean: Deal With It
I have witnessed many a newly minted data scientist start on a project only to discover with horror that the data is corrupted, incomplete, difficult to access, or requires considerable effort to use, far more than a Kaggle dataset or the standard ones used in data science bootcamps.
The impure state of data is was and will always be one of the major challenges of data science, and so my advice to aspiring data scientists is to get good at dealing with imperfect data. To me, what I mean is that as you develop your skills, focus some effort into tools and techniques that enable you to work with difficult datasets. I am a big fan of Apache Drill because it enables me to access and query large amounts of difficult data quickly without having to write code. There are certainly other tools out there, but as you develop your skills, do so with the goal in mind of finding the most efficient ways of accessing and manipulating data of all varieties.
Tip 4: Data Science is More than Machine Learning
Often when you look at data science curricula at either universities or bootcamps, you see that there is a heavy focus on machine learning. Clearly machine learning is a key component of data science, but data science is so much more than that. Really it is about identifying the correct technique to get value from the data. Sometimes that solution is quite simple statistics, other times it involves complex machine learning models. The point being that you as the data scientist, need to prescribe the right solution for your stakeholders.
A personal story, was that I was working for a client and it turned out that one of the most valuable analytics I built for them basically took two datasets and joined them together. I cannot discuss the details, and the mechanics were complex, but this simple analytic drove policy and involved absolutely zero machine learning.
Tip 5: Don’t Tell Me Your Worth, Prove It!
I’ve spoken with many people after finishing a bootcamp or other data science training program and their questions generally revolve around how to get that first job. If you don’t have a lot of professional experience, my suggestion is to find a passion project that can be shared and share it! Use your newly acquired skills on something that you are genuinely interested in. I’ve seen projects about sports analytics, restaurant data, you name it. Anyway, document your journey on github and/or a blog. It really doesn’t matter what the problem is, but work on it and use it when you go on interviews.
As an employer, this shows me a few things. Firstly, that you are capable of solving non-scripted problems. This is important as real-world problems don’t come with a script to follow. It also shows that you are capable of conceiving a project from end to end to create actual value for a stakeholder. Again, this is really important as this is what data scientists are actually supposed to do. Finally, you can showcase your technical skills in a meaningful way.
Bonus Tip: Be Kind and Help Others
One of the unfortunate down sides of the data science profession is that there are many arrogant people in it. Having an understanding of machine learning does not make you superior to anyone. So, my final tip is now that you have joined the data club, instead of looking down at people who aren’t experts in data, take the opportunity to help them and educate them. Not from a perspective of superiority, but a perspective of knowledge sharing. Personally, I mentor as many people as my time allows and view myself as an ambassador of the procession. I believe that this is a good practice for data scientists to adopt as many are not familiar with our discipline, and you don’t want to leave them with the impression that data scientists are arrogant jerks.