I am currently attending the Splunk .conf in Orlando, and a director at Accenture asked me this question, which I thought merited a blog post. Why don’t data scientists use or like Splunk. The inner child in me was thinking, “Splunk isn’t good at data science”, but the more seasoned professional in me actually articulated a more logical and coherent answer, which I thought I’d share whilst waiting for a talk to start. Here goes:
I cannot pretend to speak for any community of “data scientists” but it is true that I know a decent number of data scientists, some very accomplished and some beginners, and not a one would claim to use Splunk as one of their preferred tools. Indeed, when the topic of available tools comes up among most of my colleagues and the word Splunk is mentioned, it elicits groans and eye rolls. So let’s look at why that is the case:
Reason 1: Splunk Isn’t Open Source
Let’s start with the easy ones. The bigger data science community, which I’m defining to include big data engineers, as well as those who lean more towards mathematics and statistics, tend to gravitate towards open source platforms. Indeed all of the cutting edge data science tools that are available are open source. (Scikit-learn, Tensorflow, Jupyter, Hadoop, Spark, Drill, Keras, R, etc.) I know every time that I see a Splunk “new” capability, my immediate reaction is to ask myself what open source tool can do the same thing. Usually there is one that will do the same or better. So at that point, the conversation drifts back to why should I pay for something that I can get for free?
The closed-source nature of Splunk has other implications as well. If you want a particular feature, and it isn’t available, you are at Splunk’s mercy to develop it. This is in striking contrast to open source tools, where if a particular feature doesn’t exist you can build it and contribute it to the community. Or if there is enough interest, the community will build it and you can benefit. What this ultimately boils down to is that Splunk is behind the latest developments in data science.
On top of this, much of the more advanced functionality that a data scientist would be interested in, such as Splunk’s Machine Learning Toolkit (MLTK), is based on open source libraries. The MLTK for instance is just a limited wrapper for scikit-learn. Which leads me back to the question of why would I want to pay to use a tool I can get for free?
Reason 2: Splunk is Expensive
Splunk is a proprietary tool and their pricing is based on how much data you ingest into Splunk. This means that the more data you use, the more it will cost you. This is directly antithetical to how data scientists think.
In general, data scientists, want to use whatever data is available, and quite often merging multiple disparate data sets together. As a data scientist, I don’t want to have to worry that incorporating data set X is going to cost me more money. It’s my data dammit, I want to use it! Doctors don’t consider the costs of the treatment as they are treating a patient, and likewise data scientists don’t want to have to think about license costs of using our own data. Oh… and that license cost is quite expensive for large projects. (https://www.learnsplunk.com/splunk-pricing—splunk-licensing-model.html)
Now you might argue that you have to pay for the data indirectly regardless of whether you use Splunk or not, in the form of compute and storage costs. However, that is true also whether or not you use Splunk.
Reason 3: Splunk Isn’t Taught in Data Science Schooling
I think you could argue that most people who are employed as data science jobs today come from either a CS or Math dominated background. If you have a CS background, you will be comfortable writing code and so languages like Java or Python will appeal to you. If you come from a more academic or science/math background, you have probably worked with either Matlab or R and hence these will be your tools of choice.
From an institutional perspective, I teach data science classes for Metis as well as my own company GTK Cyber and all these programs use the Python/Pandas/Scikit-Learn ecosystem as their technical stack. I am program chair for Brandeis University’s Masters Program in Strategic Analytics and our program uses a mixture of R and Python, as do most academic programs. While we could examine why this is the case, for the purposes of this article, let’s just accept the fact that it IS the case and as a result, new data scientists aren’t taught to use Splunk, and as a result, Splunk isn’t penetrating the data science community. In general, academic institutions have an aversion to teaching commercial products anyway, so I don’t think Splunk will have much headway in academia.
Reason 4: Splunk’s Query Language is Proprietary
I don’t find SPL particularly intuitive and my time is limited, so personally, if I am going to invest the time to learn a language, there had better be value in it for me and if that language is only used by one tool, it is difficult for me to justify learning it.
The lack of transferability also applies to reason 3. If a language is only used in one proprietary tool, it is difficult for an educational institution to justify teaching it. In contrast, SQL is used by many proprietary and open source databases, and hence is taught in academia.
Reason 5: It can be difficult to get data into Splunk
In order to analyze data in Splunk, you first must ingest that data (and pay for it) into your Splunk installation. In simple cases, this is not difficult, however if your data is ugly, or complex in some way, or lives in a variety of systems, then this can get complex really quickly to the point where you will require a Splunk engineer or professional services to do.
Splunk works very well with certain kinds of data–particularly time series data and log files. However, if your data doesn’t fit that description, you very well may encounter serious difficulties getting your data into Splunk. In contrast, both Python and R have a robust collection of modules which will enable you to parse all kinds of data quickly, easily and without having to pay licensing costs or move the data around.
Reason 6: Splunk Doesn’t Really Offer any Advantage over Open Source Tools
Ok, this may be controversial and is totally my opinion, but I haven’t really seen Splunk do anything that made me say that I really want to use Splunk. For me, it is faster to use a coding language such as Python for my work. If I am exploring unknown data, I’ll use Drill or Python/Pandas. For data science specifically, there are so many other platforms, both open source and proprietary that do the functions of a data scientist much better than Splunk.
I’m not writing this to suggest that Splunk is a bad tool, however, it has not penetrated the data science community, and I wanted to put forward some reasons as to why that might be the case. What do you think?