Skip to content

Splunk and SQL, Together at last?

For the last few years, I’ve been involved with Splunk engineering. I found this to be somewhat ironic since I’ve haven’t used Splunk as a user for a really long time. I was never a fan of the Splunk query language (SPL) for a variety of reasons, the main one being that I didn’t want to spend the time to learn a proprietary language that is about as elegant as a 1974 Ford Pinto. I had worked on a few projects over the years that involved doing machine learning on data in Splunk. which presented a major challenge. While Splunk does have a machine learning toolkit (MLTK), which is basically a wrapper for scikit-learn, doing feature engineering in SPL is a nightmare.

So, how DO you do machine learning in Splunk?

You don’t.

No,Really… How do you do machine learning in Splunk?

Ok… People actually do machine learning in Splunk, but IMHO, this is not the best way to do it, for several reasons. Most of the ML work I’ve seen done in Splunk involves getting the data out of Splunk. With that being the case, the first thing I would recommend to anyone attempting to do ML in Splunk is to take a look at huntlib (https://github.com/target/huntlib) which is a python module to facilitate getting data out of Splunk. Huntlib makes this relatively easy and you can get your data from Splunk right into a Pandas DataFrame. But, you still have to know SPL or you do a basic SPL search and do all your data wrangling in Python. Could there be a better way?

The SQL/Splunk Connector

One of the projects I’ve been working on is a connector for Apache Drill to Splunk which will allow a user to query Splunk using ANSI SQL. This approach has several advantages:

  • SQL: Let’s face it. SQL is probably one of the best languages for expressing complex queries. It’s been widespread use since the 1980s and is implemented in countless systems. From a skill perspective, it is a much better investment to learn SQL than to learn a proprietary query language that could be obsolete in a few years. Bottom line, SQL is also extremely expressive and efficient for defining tables. My thought is that if you could query Splunk with SQL, it means that data scientists who don’t know SQL, can now access data in Splunk.
  • Query Optimization: When you start with Splunk, one of the first thing you will learn is that Splunk does not optimize queries. This means that this task falls on you as the query author. The problem here is that this requires you to have an understanding of how Splunk actually works. This is in contrast to most SQL databases which do this optimization for you. Furthermore, many SQL engines can make recommendations as to how to improve your queries. Therefore, the Drill/SQL connector will perform these optimizations for you so you don’t have to learn how to optimize Splunk queries.
  • External Data Sets: One of the big issues with Splunk is what to do when some of your data is in Splunk and some isn’t. Now, the Splunk folks will say, “simple.. put it into Splunk. Problem solved.” But in the enterprise, it’s not so simple. In an enterprise, a user will most likely never have permission to do this and a user will have to involve an engineering team to get external data into Splunk. Since Splunk charges by data volume ingested, an enterprise cannot have users putting random, unvetted data into a production system as it affects budget. This also takes time. My observation is that it will usually take an enterprise weeks to get a data source into Splunk. With Drill and SQL, you can easily join external data sets with the data in Splunk with a simple SQL JOIN statement.

What does all this have to do with machine learning?

The first phase of any ML project is gathering your data and extracting features. Once that’s done, you can explore the data. Splunk is pretty good at data exploration, but IMHO, extracting features using Splunk is a nightmare. There’s a lot of inefficiency as well but we’ll save that for another time. Anyway, when you combine Splunk/Drill and John Omernik’s awesome Jupyter Notebook integration for Drill (which I will discuss in a future blog post) you can simply:

  • Type a SQL Query into a Jupyter notebook cell
  • The integration will query Drill and pipe your data directly into a DataFrame!

Done! You’re off to the races!

How Does the Drill/Splunk Connector Work?

The complete documentation is available here in the pull request, but it is fairly straightforward. You will need to configure your Drill installation to connect to Splunk. Splunk uses port 8089 for programmatic access, so this must be open. Anyway, to configure Drill, simply navigate to the Storage plugin page, click on add new plugin, name it splunk and paste in the following configuration:

{
   "type":"splunk",
   "username": "admin",
   "password": "changeme",
   "hostname": "localhost",
   "port": 8089,
   "earliestTime": "-14d",
   "latestTime": "now",
   "enabled": false
}

Once you click submit, you should now be able to query Splunk via Drill.

The Drill/Splunk Data Model

Drill treats Splunk indexes as tables. Splunk’s access model does not seem to restrict access to the catalog, but does restrict access to the actual data. It is therefore possible that you can see the names of indexes to which you do not have access. You can view the list of available indexes with a SHOW TABLES IN splunk query.

apache drill> SHOW TABLES IN splunk;
+--------------+----------------+
| TABLE_SCHEMA |   TABLE_NAME   |
+--------------+----------------+
| splunk       | summary        |
| splunk       | splunklogger   |
| splunk       | _thefishbucket |
| splunk       | _audit         |
| splunk       | _internal      |
| splunk       | _introspection |
| splunk       | main           |
| splunk       | history        |
| splunk       | _telemetry     |
+--------------+----------------+
9 rows selected (0.304 seconds)

To query Splunk from Drill, simply use the following format:

SELECT <fields>
FROM splunk.<index>

Drill will execute the query and return the results in a nice clean table!

Bounding Your Queries

When you learn to query Splunk via their interface, the first thing you learn is to bound your queries so that they are looking at the shortest time span possible. When using Drill to query Splunk, it is advisable to do the same thing, and Drill offers two ways to accomplish this: via the configuration and at query time.

Bounding your Queries at Query Time

The easiest way to bound your query is to do so at querytime via special filters in the WHERE clause. There are two special fields, earliestTime and latestTime which can be set to bound the query. If they are not set, the query will be bounded to the defaults set in the configuration.

You can use any of the time formats specified in the Splunk documentation here:
https://docs.splunk.com/Documentation/Splunk/8.0.3/SearchReference/SearchTimeModifiers

So if you wanted to see your data for the last 15 minutes, you could execute the following query:

SELECT <fields>
FROM splunk.<index>
WHERE earliestTime='-15m' AND latestTime='now'

The variables set in a query override the defaults from the configuration.

Sending Arbitrary SPL to Splunk

You may have a Splunk query that doesn’t map to this model, and for those situations, there is a special table called spl which you can use to send arbitrary SPL queries to Splunk. If you use this table, you must include a query in the spl filter as shown below:

SELECT *
FROM splunk.spl
WHERE spl='<your SPL query>'

This is very much a work in progress, so if you try this out, I would welcome feedback for how it worked for you. The documentation explains how to deal with nested data from Splunk and all that. Good luck!

Share the joy

4 Comments

  1. FWIW – Splunk does have a query optimizer. It has had one since 2016. I joined in 2014 and it was my first major achievement at Splunk to get an AST and basic optimizer created (predicate pushdown, projection elimination etc). That optimizer has been improved since then.

    • Charles Givre Charles Givre

      I figured it had something but the issue that I see is that SPL still puts a heavy burden on the user in terms of understanding the dataflow. For instance, let’s say that you want to do a simple query like querying an index, selecting some fields, limiting the results and mapping to a table. The order in which you place these commands in the SPL, even for something that simple, will affect the query performance. Whereas with SQL, you this doesn’t matter. Now, there are things with SQL, such as querying on indexed fields, JOIN order etc. where you DO need to know about the underlying infrastructure, but IMHO, the burden is a lot less.

Leave a Reply

Your email address will not be published. Required fields are marked *