Drilling Security Data

Last Friday, the Apache Drill released Drill version 1.14 which has a few significant features (plus a few that are really cool!) that will enable you to use Drill for analyzing security data. Drill 1.14 introduced:

A logRegex reader which enables Drill to read anything you can describe with a Regex
An image metadata reader, which enables you to query images
A suite a of GIS functionality
A collection of phonetic and string distance functions which can be used for approximate string matching.

These suite of functionality really expands what is possible with Drill, and makes analysis of many different types of data possible. This brief tutorial will walk you through how to configure Apache Drill to query log files, or any file really that can be matched with a regex.

Setting up The Format Plugin

The logRegex format plugin really can be used with any kind of files in which the observations are contained in one line and the fields can be broken up by a regex. Unlike all the other format plugins that exist for Drill, the logRegex reader allows you to define a schema for your data. On the surface this violates Drill’s philosophy of schema-less querying, but in this case, it is necessary in order to interpret the data.

In order to configure the plugin, start Drill, navigate to the Storage plugin page, and click on the storage plugin where your log files are, most likely dfs. Once you’re there, scroll down to the formats section.

This plugin has a lot of power and hence has many parameters which are listed below:

type: This tells Drill which extension to use. In this case, it must be logRegex. This field is mandatory.
regex: This is the regular expression which defines how the log file lines will be split. You must enclose the parts of the regex in grouping parentheses that you wish to extract. Note that this plugin uses Java regular expressions and requires that shortcuts such as \d have an additional slash: ie \\d. This field is mandatory.
extension: This option tells Drill which file extensions should be mapped to this configuration. Note that you can have multiple configurations of this plugin to allow you to query various log files. This field is mandatory.
maxErrors: Log files can be inconsistent and messy. The maxErrors variable allows you to set how many errors the reader will ignore before halting execution and throwing an error. Defaults to 10.
schema: The schema field is where you define the structure of the log file. This section is optional. If you do not define a schema, all fields will be assigned a column name of field_n where n is the index of the field. The undefined fields will be assigned a default data type of VARCHAR.

Of these fields, the two that are most important are the schema and the regex fields. The regex field is where you define how your logs will be parsed. In order for fields to be extracted, you must use grouping parentheses around the fields.

The other crucial field is the schema field. Unlike all other format plugins, this one allows you explicitly define a schema for your data. If you don’t do this, Drill will assume all the fields are VARCHARs and will assign them a name of field_n where n is the index of your field. An example schema is shown below.

The schema is an array of field objects which contain:

fieldName: The name of the field
fieldType: The data type of the field. At the time of writing, Drill supports INT, FLOAT, DATE, TIME, VARCHAR. If you don’t specify a data type, Drill defaults to VARCHAR.
format: Mandatory when using a DATE or TIME data type, this is the format string for how Drill should parse the dates. Uses JODA time.

"schema": [
        {
          "fieldName": "eventDate",
          "fieldType": "DATE",
          "format": "yyMMdd"
        },
        {
          "fieldName": "eventTime",
          "fieldType": "TIME",
          "format": "HH:mm:ss"
        },
        {
          "fieldName": "PID",
          "fieldType": "INT"
        }
  ]

Thus, a complete entry for this format plugin to read MySQL logs would look something like this:

"log" : {
      "type" : "logRegex",
      "extension" : "log",
      "regex" : "(\\d{6})\\s(\\d{2}:\\d{2}:\\d{2})\\s+(\\d+)\\s(\\w+)\\s+(.+)",
      "maxErrors": 10,
      "schema": [
        {
          "fieldName": "eventDate",
          "fieldType": "DATE",
          "format": "yyMMdd"
        },
        {
          "fieldName": "eventTime",
          "fieldType": "TIME",
          "format": "HH:mm:ss"
        },
        {
          "fieldName": "PID",
          "fieldType": "INT"
        },
        {
          "fieldName": "action"
        },
        {
          "fieldName": "query"
        }
      ]
   }

Once you’ve done that, save your settings and you’re ready to query your log files! This format plugin has two implicit columns: _raw and _unmatched_rows. _raw returns the full text of the rows that match the regex. _unmatched_rows returns the rows in your file that did not match the regex. This can be useful for finding rows that do not match the regex.

If this is of interest to you, you can read about this and a lot more in my forthcoming book about Apache Drill, co-authored with me and Drill master developer, Paul Rogers. I believe that Drill has a lot of potential use for security data. I’m currently working on a PR to get Drill to natively read syslog data, which hopefully will be done for Drill 1.15. Stay tuned!

Share the joy

Drilling Security Data

Setting up The Format Plugin

Related

One Comment

Drilling Security Data

Setting up The Format Plugin

Share this:

Related

One Comment