Skip to content

The Dataist Posts

Splunk and SQL, Together at last?

For the last few years, I’ve been involved with Splunk engineering. I found this to be somewhat ironic since I’ve haven’t used Splunk as a user for a really long time. I was never a fan of the Splunk query language (SPL) for a variety of reasons, the main one being that I didn’t want to spend the time to learn a proprietary language that is about as elegant as a 1974 Ford Pinto. I had worked on a few projects over the years that involved doing machine learning on data in Splunk. which presented a major challenge. While Splunk does have a machine learning toolkit (MLTK), which is basically a wrapper for scikit-learn, doing feature engineering in SPL is a nightmare.

So, how DO you do machine learning in Splunk?

You don’t.

No,Really… How do you do machine learning in Splunk?

Ok… People actually do machine learning in Splunk, but IMHO, this is not the best way to do it, for several reasons. Most of the ML work I’ve seen done in Splunk involves getting the data out of Splunk. With that being the case, the first thing I would recommend to anyone attempting to do ML in Splunk is to take a look at huntlib (https://github.com/target/huntlib) which is a python module to facilitate getting data out of Splunk. Huntlib makes this relatively easy and you can get your data from Splunk right into a Pandas DataFrame. But, you still have to know SPL or you do a basic SPL search and do all your data wrangling in Python. Could there be a better way?

Share the joy
4 Comments

How to Manage a Data Breach Incident

Facebook’s big data breach in 2018 sent millions of users into a flurry to reset their Facebook passwords, hoping and praying that their information hadn’t found its way into the wrong hands. Some 50 million Facebook accounts were compromised three days before the company’s announcement, granting hackers access to personal data and landing the social media giant in hot water. Despite damage control efforts, this large-scale hack called into question the overall safety of online accounts.

Big companies are big targets, so it’s no surprise that we only hear about high-profile data breaches. In reality, however, smaller businesses are the ones who are more at risk of being affected by a breach, as they ultimately have much more to lose. When it comes to detecting an attack, researchers from IBM estimate that it takes small businesses about 200 days before a breach is found — and by then, the attack is well under way. It can cost a large amount of money and reputational damage for a business to get back on its feet after a breach, especially when taking into account the time and manpower required to rebuild a company’s operational systems.

All in all, predicting and preventing a data breach is tricky. And while some argue that breaches are inevitable in this day and age, businesses should still arm themselves with the proper tools and information on not only how to prevent a data breach, but also what to do when defenses fail and an attack occurs.

But first, it’s important to know what you might be dealing with. While not all cybersecurity attacks can result in a breach, this article by Chief Executive outlines seven types of data breaches, ranging from hacking via malware and phishing, to employee negligence and physical theft. Being aware of these attacks can help you prevent them, or at least know what to do in the off-chance that your company falls victim.

Share the joy
Leave a Comment

5.5 Tips on Starting a Career in Data Science

People often ask me questions about starting a career in data science or for advice what tech skills they should acquire.  When I get asked this question, I try to have a conversation with the person to see what their goals and aspirations are as there’s no advice that I can give that is universal, here are five pointers that I would say are generally helpful for anyone starting a career in data science or data analytics.

Tip 1: Data Science is a Big Field:  You Can’t Know Everything About Everything:

When you start getting into data science, the breadth of the field can be overwhelming. It seems that you have to be an expert in big data systems, relational databases, computer sciences, linear algebra, statistics, machine learning, data visualization, data engineering, SQL, Docker, Kubernetes, and much more.  To say nothing about subject matter expertise.  One of the big misunderstandings I see is the perception that you have to be an expert in all these areas to get your first job.   

Share the joy
Leave a Comment

Explore REST APIs without Code!

One of the big challenges a data scientist faces is the amount of data that is not in convenient formats. One such format are REST APIs. In large enterprises, these are especially problematic for several reasons. Often a considerable amount of reference data is only accessible via REST API which means that to access this data, users are required to learn enough Python or R to access this data. But what if you don’t want to code?

The typical way I’ve seen this dealt with is to create a duplicate of the reference data in an analytic system such as Splunk or Elasticsearch. The problems with this approach are manifold. First, there is the engineering effort (which costs time and money) to set up the data flow. On top of that, you now have the duplicate cost of storage and now you are maintaining multiple versions of the same information.

Another way I’ve seen is for each API owner to provide a graphical interface for their API, which is good, but the issue there is that now the data is stove-piped and can’t be joined with other data, which defeats its purpose altogether. There has to be a better way…

Take a look at the video tutorial below to see a demo!

Share the joy
4 Comments

On the Importance of Good Design

The image above is a late 50’s MGA convertible. I picked this because I happen to think that this car is one of the most elegantly designed cars ever made. Certainly in the top 50. While we as people place a lot of emphasis on design when it comes to physical objects that we use, when it comes to software, a lot of our software’s design looks more like the car below: This vehicle looks like it was designed by a committee that couldn’t decide whether they were designing a door stop or a golf cart.

I’ve been doing a lot of thinking lately (for reasons that will become apparent in a few weeks) about the lack of good software engineering practices in the data science space. The more I think about it, the more I am shocked with the questionable design that exists in the data science ecosystem, particularly in Python. What’s even more impressive is that the python language is actually designed to promote good design, and the people building these modules are all great developers.

As an example of this, I recently read an article about coding without if statements and it made me cringe. Here is a quote:

When I teach beginners to program and present them with code challenges, one of my favorite follow-up challenges is: Now solve the same problem without using if-statements (or ternary operators, or switch statements).

You might ask why would that be helpful? Well, I think this challenge forces your brain to think differently and in some cases, the different solution might be better.

There is nothing wrong with using if-statements, but avoiding them can sometimes make the code a bit more readable to humans. This is definitely not a general rule as sometimes avoiding if-statements will make the code a lot less readable. You be the judge.

https://medium.com/edge-coders/coding-tip-try-to-code-without-if-statements-d06799eed231

Well, I am going to be the judge and every example the author cited made the code more difficult to follow. The way I see it, hard to follow code means that you as the developer are more likely to introduce unintended bugs and errors. Since the code is difficult to follow, not only will you have more bugs, but these bugs will take you more time to find. Even worse, if you didn’t write this code and are asked to debug it!! AAAAHHHH!! I know if that were me, I wouldn’t even bother. I’d probably just rewrite it. The only exception would be if it was heavily commented.

Share the joy
Leave a Comment

Pandemics, Birthday, and Life Before Snark

Today marks about the 45th day I’ve been stuck in the house and it happens that my birthday was last week, so I’ve been doing a lot of reading and reflecting on things. The last few weeks have been really up and down. I’ve been doing a lot of puttering around the house and working on silly projects like replacing the headlight gaskets on my MGA, which also involved painting the headlight buckets, cutting off rusty screws and redoing wiring, but I digress. Despite being home all the time, I’m finding it very difficult to get any meaningful work done.

Before

After

On the up side, I’ll be doing some new online classes with O’Reilly starting around the end of May! The topics relate to coding practices and data visualization, so stay tuned!

Share the joy
Leave a Comment

Public Data Still Lacking on COVID-19 Outbreak

As you are reading this, you are probably (like me) under quarantine or shelter in place due to the COVID-19 outbreak. As a data scientist who has been stuck in the house since 10 March, I wanted to take a look at the data and see what I could figure out. I’m not an epidemiologist and claim no expertise in health care, but I do know data science so please take what I am saying with a grain of salt.

Why is there no data?

My first observation is that very little data is actually being made publicly available. I am not sure why this is the case, but I spent a considerable amount of time digging through the WHO, CDC and other agencies’ websites and APIs and found little usable data. For example, the World Health Organization (WHO) posts daily situation reports with data, however the sitreps contain data, however the files are in PDF format. I attempted to extract these tables from the PDFs however this proved to be extremely difficult as the formatting was not consistent. It would be trivial to post this data in CSV, HDF5 or some other format that is conducive to data analysis, however the WHO did not choose to do that. I found generally the same situation at the other major health institutions such as the CDC.

Health related information in the United States is regulated by the Health Insurance Portability and Accountability Act (HIPAA), which imposes draconian fines and restrictions on private health information, so some of the secrecy may be due to this law.

Share the joy
1 Comment

Easy Analysis of HDF5 Data

There is a data format called HDF5 (Hierarchical Data Format) which is used extensively in scientific research. HDF5 is an interesting format in that it is like a file system within a file, and is extremely performant. However, the HDF5 format can be quite difficult to actually access the data that is encoded in HDF5 format. But, as the title suggests, this post will walk you through how to easily access and query HDF5 datasets using my favorite tool: Apache Drill.

As of version 1.18 Drill will natively support reading HDF5 files.

Share the joy
2 Comments

5 Things Data Science Bootcamps Should Teach

Let me start out by saying this is purely hypothetical as I’ve never been to a data science bootcamp, but I have taught them and have reviewed MANY curricula. I’ve also mentored a decent number of bootcamp graduates. In general, what I see is that bootcamps tend to place a lot of emphasis on machine learning but there’s a lot more to being a successful data scientist. The list below are five areas which I believe would benefit any aspiring data scientist.

SQL

Let’s start with an easy one. SQL. Despite all the trashing that it gets, SQL is an enormously useful language to know. Despite all the hype one hears about NoSQL and other non-relational datastores, SQL is still in widespread use and is not likely to go anywhere anytime soon. Let me tell you why you should learn SQL….

Share the joy
Leave a Comment

When Categorical Data Goes Wrong

I ran into an issue whilst doing a machine learning project involving some categorical data and thought I would write a brief tutorial about what I learned. I was working on a model which had a considerable amount of categorical data and I ran into several issues which can briefly be summarized as:

  • Categories that were present in the training set were not always present in the testing data
  • Categories that were present in the testing set were not always present in the training data
  • Categories from “real world” (IE non testing or training) data were not present in the training or testing data

Handling Categorial Data: A brief tutorial

In Python, one of the unfortunate things about the scikit-learn/pandas modules is that they don’t really deal with categorical data very well. In the last few years, the Pandas community has introduced a “categorical” datatype. Unfortunately, this datatype does not carry over to scikit-learn, so if you have categorical data, you still have to encode it. Now there are tons of tutorials on the interweb about how to do this, so in the interests of time, I’ll show you the main methods:

GetDummies in Pandas

The most conventional approach and perhaps the easiest is pandas get_dummies() function which takes the input of a given column or columns and returns dummy columns for each category value. (Full docs here). Thus you can do the following:

df = pd.get_dummies(df)
data
0a
1b
2c
3a
4a
5a
6c
7c
8c
data_adata_bdata_c
0100
1010
2001
3100
4100
5100
6001
7001
8001

Which turns the table on the left into the table on the right.

As you can see, each category is encoded into a separate column with the column name followed by an underscore and the category variable. If the data is a member of that category, the column has a value of 1 otherwise the value is zero, hence the name One Hot Encoding.

In general this works, but the pandas method has the problem of not working as a part of a Scikit-Learn pipeline. As such scikit-learn also has a OneHotEncoder which you can use to do basically the same thing.

Personally, I find scikit’s OneHotEncoder to be a bit more difficult to use, so I didn’t really use it much, however, in my recent project I realized that I actually had to for a reason I’ll get to in a bit.

Scikit Learn’s OneHotEncoder

Scikit-Learn has the OneHotEncoder() (Docs here) which does more or less the same thing as the pandas version. It does have several limitations, and quirks. The first being that the data types of your categories must be the same. IE if you have ints and strings, no go.. Secondly, scikit’s encoder returns either a numpy array or a sparse matrix as a result. Personally, this was annoying for me as I wanted to see what categories were useful as features, and in order to do so, you have to reconstruct a dataframe, which is a headache. In general, the code follows scikit’s general pattern of fit(), transform(). Here is example code of how to use scikit’s one hot encoder:

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
encoded_data = encoder.fit_transform(df[<category_columns>])

There are two advantages that I see that scikit’s method has over pandas. The first is that when you fit the scikit encoder, it now “remembers” what categories it has seen, and you can set it to ignore unknown categories. Whereas pandas does not have any recall and will just automatically convert all columns to dummy variables. The second is that you could include the OneHotEncoder in a pipeline which seemed like that would be advantageous as well. However, these advantages did not outweigh the difficulty of getting the data back into a dataframe with column labels. Also, I kept getting errors relating to datatypes and got really frustrated.

The original problem I was having was that you couldn’t guarantee that all categories would be present in both the training and testing set, so the solution I came up with was to write a function that switched the category value to “OTHER” if the category was not one of the top few. But I didn’t like this approach because it required me to maintain a list of categories and what happened if that list changed over time? Surely there’s a better way…

Feature-Engine: A Better Solution

So what if I told you there was a way to encode categorical data such that you could:

  • Handle missing categories in either testing, training or real world data
  • Export the data to a DataFrame for easy analysis of the newly created features
  • Automatically aggregate categories with few values into an “other” category

Well you can’t so get over it. Ok, just kidding. I wouldn’t write a whole blog post to have it end like that… or would I? As it turns out, I stumbled upon a really useful module called feature-engine which contains some extremely useful tools for feature engineering that frankly should be included in Scikit-Learn. This module contains a collection of really useful stuff, but I’m just going to focus on the OneHotCategoryEncoder. (Docs here)

Let’s say you wanted to encode the data above, using the OneHotCategoryEncoder() you could create an encoder object as shown below:

from feature_engine import categorical_encoders as ce
import pandas as pd

# set up the encoder
encoder = ce.OneHotCategoricalEncoder(
    top_categories=3,
    drop_last=False)

# fit the encoder
encoder.fit(df)
encoder.transform(df)

Now, once we have the encoder object, we can encode our data using the fit()/transform() or the fit_transform() methods as shown above. Our toy data set above only has 3 categories, but what if it had 300? Feature-Engine provides an option in the constructor, top_categories, which has the effect of collapsing the into a more manageable number. For example, you could set the top_categories to 10 and that would get you the 10 most frequently occurring category columns and all others would be collapsed into an “other” column. That’s a nice feature! Well done!

There’s more. In our previous example, we had three categories when we fit the data, ‘A’, ‘B’ and ‘C’. So what happens if we have another category in the data that did not appear in the training data? Good question, and one that is not explicitly addressed in the documentation. So I tried this out and if you have the top_categories set, the encoder will ignore the unknown categories. This is debatable as to whether this is good design or not, but what it does mean is that it will work much better in real world applications.

Since the OneHotCategoricalEncoder uses the fit()/fit_transform()/transform() from scikit-learn, it can be used in a Pipeline object. Finally, and perhaps most important to me, is that the OneHotCategoricalEncoder returns a pandas DataFrame rather than numpy arrays or other sparse matrices. The reason this mattered to me was that I wanted to see which categorical columns actually are adding value to the model and which are not. Doing this from a numpy array without column references is exceedingly difficult.

TL;DR

In conclusion, both scikit-learn and Pandas traditional ways of encoding categorical variables have significant disadvantages, so if you have categorical data in your model, I would strongly recommend taking a look at Feature-Engine’s OneHotCategoricalEncoder.

Share the joy
2 Comments