Why don’t Data Scientists use Splunk?

Published October 5, 2018

I am currently attending the Splunk .conf in Orlando, and a director at Accenture asked me this question, which I thought merited a blog post. Why don’t data scientists use or like Splunk. The inner child in me was thinking, “Splunk isn’t good at data science”, but the more seasoned professional in me actually articulated a more logical and coherent answer, which I thought I’d share whilst waiting for a talk to start. Here goes:

I cannot pretend to speak for any community of “data scientists” but it is true that I know a decent number of data scientists, some very accomplished and some beginners, and not a one would claim to use Splunk as one of their preferred tools. Indeed, when the topic of available tools comes up among most of my colleagues and the word Splunk is mentioned, it elicits groans and eye rolls. So let’s look at why that is the case:

Reason 1: Splunk Isn’t Open Source

Let’s start with the easy ones. The bigger data science community, which I’m defining to include big data engineers, as well as those who lean more towards mathematics and statistics, tend to gravitate towards open source platforms. Indeed all of the cutting edge data science tools that are available are open source. (Scikit-learn, Tensorflow, Jupyter, Hadoop, Spark, Drill, Keras, R, etc.) I know every time that I see a Splunk “new” capability, my immediate reaction is to ask myself what open source tool can do the same thing. Usually there is one that will do the same or better. So at that point, the conversation drifts back to why should I pay for something that I can get for free?

The closed-source nature of Splunk has other implications as well. If you want a particular feature, and it isn’t available, you are at Splunk’s mercy to develop it. This is in striking contrast to open source tools, where if a particular feature doesn’t exist you can build it and contribute it to the community. Or if there is enough interest, the community will build it and you can benefit. What this ultimately boils down to is that Splunk is behind the latest developments in data science.

On top of this, much of the more advanced functionality that a data scientist would be interested in, such as Splunk’s Machine Learning Toolkit (MLTK), is based on open source libraries. The MLTK for instance is just a limited wrapper for scikit-learn. Which leads me back to the question of why would I want to pay to use a tool I can get for free?

Reason 2: Splunk is Expensive

Splunk is a proprietary tool and their pricing is based on how much data you ingest into Splunk. This means that the more data you use, the more it will cost you. This is directly antithetical to how data scientists think.

In general, data scientists, want to use whatever data is available, and quite often merging multiple disparate data sets together. As a data scientist, I don’t want to have to worry that incorporating data set X is going to cost me more money. It’s my data dammit, I want to use it! Doctors don’t consider the costs of the treatment as they are treating a patient, and likewise data scientists don’t want to have to think about license costs of using our own data. Oh… and that license cost is quite expensive for large projects. (https://www.learnsplunk.com/splunk-pricing—splunk-licensing-model.html)

Now you might argue that you have to pay for the data indirectly regardless of whether you use Splunk or not, in the form of compute and storage costs. However, that is true also whether or not you use Splunk.

Reason 3: Splunk Isn’t Taught in Data Science Schooling

I think you could argue that most people who are employed as data science jobs today come from either a CS or Math dominated background. If you have a CS background, you will be comfortable writing code and so languages like Java or Python will appeal to you. If you come from a more academic or science/math background, you have probably worked with either Matlab or R and hence these will be your tools of choice.

From an institutional perspective, I teach data science classes for Metis as well as my own company GTK Cyber and all these programs use the Python/Pandas/Scikit-Learn ecosystem as their technical stack. I am program chair for Brandeis University’s Masters Program in Strategic Analytics and our program uses a mixture of R and Python, as do most academic programs. While we could examine why this is the case, for the purposes of this article, let’s just accept the fact that it IS the case and as a result, new data scientists aren’t taught to use Splunk, and as a result, Splunk isn’t penetrating the data science community. In general, academic institutions have an aversion to teaching commercial products anyway, so I don’t think Splunk will have much headway in academia.

Reason 4: Splunk’s Query Language is Proprietary

I am lazy. I’ve put in my time learning programming languages and have written code in Java, C, R, Python, PHP, JavaScript and others. I’m also pretty good at SQL. Personally, I don’t want to learn yet another language, especially if that doesn’t transfer to any other tool.

I don’t find SPL particularly intuitive and my time is limited, so personally, if I am going to invest the time to learn a language, there had better be value in it for me and if that language is only used by one tool, it is difficult for me to justify learning it.

The lack of transferability also applies to reason 3. If a language is only used in one proprietary tool, it is difficult for an educational institution to justify teaching it. In contrast, SQL is used by many proprietary and open source databases, and hence is taught in academia.

Reason 5: It can be difficult to get data into Splunk

In order to analyze data in Splunk, you first must ingest that data (and pay for it) into your Splunk installation. In simple cases, this is not difficult, however if your data is ugly, or complex in some way, or lives in a variety of systems, then this can get complex really quickly to the point where you will require a Splunk engineer or professional services to do.

Splunk works very well with certain kinds of data–particularly time series data and log files. However, if your data doesn’t fit that description, you very well may encounter serious difficulties getting your data into Splunk. In contrast, both Python and R have a robust collection of modules which will enable you to parse all kinds of data quickly, easily and without having to pay licensing costs or move the data around.

Reason 6: Splunk Doesn’t Really Offer any Advantage over Open Source Tools

Ok, this may be controversial and is totally my opinion, but I haven’t really seen Splunk do anything that made me say that I really want to use Splunk. For me, it is faster to use a coding language such as Python for my work. If I am exploring unknown data, I’ll use Drill or Python/Pandas. For data science specifically, there are so many other platforms, both open source and proprietary that do the functions of a data scientist much better than Splunk.

Conclusion

I’m not writing this to suggest that Splunk is a bad tool, however, it has not penetrated the data science community, and I wanted to put forward some reasons as to why that might be the case. What do you think?

Share the joy

Published in Data Science, General Thoughts and Uncategorized

9 Comments

msdhonivenky

Inspirational content, have achieved a good knowledge from the above content on Data Science useful for all the aspirants of Data Science training.

February 12, 2019 Reply
Vadim

The fact that they don’t allow the code review leads to the thought that they could be another spying tool, which does not help to start using it.

March 6, 2019 Reply
bucweat

Your argument above is a very good one and I agree with all your points. However, one thing to consider is, at least for the US Government, open source software is frowned upon and in many cases not allowed due to the perception (true or not) that open source contains security flaws. IA/IT folks constantly push for commercial software like Splunk because they believe that the vendors who they pay lots of money to are ensuring that their software does not contain such security flaws.

March 6, 2019 Reply
- Charles Givre
  
  My experience in the USG is quite the opposite. I believe in fact that there are directives that agencies maximize the use of open source software. Also my experience as a PMC member on the Apache Drill project has taught me a lot about open source governance, specifically, for Apache products at least, how much review actually does go into open source software.
  
  March 17, 2019 Reply
PinkyandtheBain

I personally do not agree with most of the points you described here.

1) Why should i care that the tool is not open source ? It did not prevent until now any data analyst to work and query data from Oracle or Microsoft products for example (DB, Datawarehouse, ETL etc.). Regarding Splunk itself, it does provide a way to ingest data from many servers, applications etc. figuring out the way to ingest it in an automated fashion without having to sit down and find the pattern yourself (which can be obvious or absolutely not). That´s not something that you will easily achieve yourself and the main reasons why Splunk is used is to ingest machine data and to be used as a SIEM, not as a data science toolbox.
You said that if you want a specific feature, you will have to wait for Splunk to implement it. Once again this is not true. You can enrich the language yourself or develop your own addon.
2) It is expensive for several reasons. You can argue that you can use ELK to perform the same tasks but ELK does not provide addons that Splunk can and does not perform as well as Splunk can when it comes to clustering and data consistency / availability. Plus who cares if a company pays for a product if it fits the way to go forward with log management ? People tend to forget that developing features does cost a lot of money (how many man hours are necessary ???). Furthermore, I have seen many big data projects (let´s say that you want to ingest all this data in Hadoop) fail because the complexity and the timeline to create something are just crazy. Even Big Data companies like Big Data Partnership have spent thousand of hours to build something based on Hadoop whose results were just hmm useless. I have also seen many “Data Scientists” claiming to provide something awesome when the same thing could have been achieved with less complexity and no fancy maths and algorithms.
3) I do agree with this point but Splunk will sooner or later go into data science. The first step is Machine Learning. Since they can ingest a lot of data, the natural direction for Splunk will be to provide the platform and the ingested data with integration with open source data science products like they did for MLTK.
4) I honestly do not care. People learning C# or VB.net do not think about the fact that the languages are proprietary. The only thing that matters here is how big the community using this language is. Splunk has now many “developers” using their SPL. It´s in fact not that intuitive when you are used to SQL but that´s just a matter of habits. I do use SPL every single day and the learning curve is steep but not that bad.
5) Get data in Splunk is far easier than trying to ingest the data, find the way to cut it etc. in any other available tool on the market. Python and R will not help you ingest the data., they will only help you cut and parse the data.
6) It depends on what you are talking about but it does provide a lot of things when it comes to data ingestion and data analysis. You can say whatever you want. If you would like to compare how fast someone can ingest data then create a dashboard vs you using Python and R, you will lose for sure.

Conclusion :
My experience about Splunk vs Data Science is that machine data is going to Splunk while business data is retrieved for platforms managed by the Big Data engineers and Data Scientists. There is however something that i hate with Data Scientists. They think that they are the kings of this world, that their way is the best way, they think open source while i think about delivery and efficiency. And, of course, they absolutely do not want to share their knowledge or they do not want to provide interfaces to other products or even developers to see their data. While Splunk provides a platform for anyone to consume the data, The data scientists will never share this data to someone else for consumption (how would they anyway ?).

October 9, 2019 Reply
PinkyandtheBain

I personally do not agree with most of the points you described here.

1) Why should i care that the tool is not open source ? It did not prevent until now any data analyst to work and query data from Oracle or Microsoft products for example (DB, Datawarehouse, ETL etc.). Regarding Splunk itself, it does provide a way to ingest data from many servers, applications etc. figuring out the way to ingest it in an automated fashion without having to sit down and find the pattern yourself (which can be obvious or absolutely not). That´s not something that you will easily achieve yourself and the main reasons why Splunk is used is to ingest machine data and to be used as a SIEM, not as a data science toolbox.
You said that if you want a specific feature, you will have to wait for Splunk to implement it. Once again this is not true. You can enrich the language yourself or develop your own addon.
2) It is expensive for several reasons. You can argue that you can use ELK to perform the same tasks but ELK does not provide addons that Splunk can and does not perform as well as Splunk can when it comes to clustering and data consistency / availability. Plus who cares if a company pays for a product if it fits the way to go forward with log management ? People tend to forget that developing features does cost a lot of money (how many man hours are necessary ???). Furthermore, I have seen many big data projects (let´s say that you want to ingest all this data in Hadoop) fail because the complexity and the timeline to create something are just crazy. Even Big Data companies like Big Data Partnership have spent thousand of hours to build something based on Hadoop whose results were just hmm useless. I have also seen many “Data Scientists” claiming to provide something awesome when the same thing could have been achieved with less complexity and no fancy maths and algorithms.
3) I do agree with this point but Splunk will sooner or later go into data science. The first step is Machine Learning. Since they can ingest a lot of data, the natural direction for Splunk will be to provide the platform and the ingested data with integration with open source data science products like they did for MLTK.
4) I honestly do not care. People learning C# or VB.net do not think about the fact that the languages are proprietary. The only thing that matters here is how big the community using this language is. Splunk has now many “developers” using their SPL. It´s in fact not that intuitive when you are used to SQL but that´s just a matter of habits. I do use SPL every single day and the learning curve is steep but not that bad.
5) Get data in Splunk is far easier than trying to ingest the data, find the way to cut it etc. in any other available tool on the market. Python and R will not help you ingest the data., they will only help you cut and parse the data.
6) It depends on what you are talking about but it does provide a lot of things when it comes to data ingestion and data analysis. You can say whatever you want. If you would like to compare how fast someone can ingest data then create a dashboard vs you using Python and R, you will lose for sure.

Conclusion :
My experience about Splunk vs Data Science is that machine data is going to Splunk while business data is retrieved for platforms managed by the Big Data engineers and Data Scientists. There is however something that i hate with Data Scientists. They think that they are the kings of this world, that their way is the best way, they think open source while i think about delivery and efficiency. And, of course, they absolutely do not want to share their knowledge or they do not want to provide interfaces to other products or even developers to see their data. While Splunk provides a platform for anyone to consume the data, The data scientists will never share this data to someone else for consumption (how would they anyway ?).

PS : I´m a Data Analyst using SQL, Python, some R snippets and Splunk on a day to day basis. I´m absolutely not a data scientist who would rather use other tools like the ones you named.

October 9, 2019 Reply
David Noah Guarneri

Good info.

I think it comes down to use case. If you’re having to analyze in real time a large amount of data from a large number of disparate sources located in different places geographically, then Splunk would be your solution. You do need at least one or two Splunk admins to onboard the data and support the system, so it’s not cheap.

If you’re just importing a spreadsheet or two with thousands of records, then it’s overkill.

January 2, 2020 Reply
Anonymous

Keep in mind that the large Splunk app community is largely open source. Anyone can develop an app or an add-on for Splunk, so data scientists are not at the mercy of Splunk to develop new features–they can develop them themselves, for free.

SPL is, in my view, quite intuitive and easy to learn. I didn’t have a background in coding at all (and still don’t), but I took a class on Udemy and found it to be quite easy to get Splunking.

Splunk has a free tier.

If anyone cares, this is the class I took on Udemy. (no, I don’t work for Udemy, or Splunk.)

https://www.udemy.com/course/splunker/

January 22, 2020 Reply
Rich Brandt

Splunk consists of a simple back-end with an array of half-baked front-end apps. It’s great as a Swiss Army knife, albeit overly expensive for that use case. Much better options exist at the enterprise level.

March 2, 2021 Reply