Skip to content

The Dataist Posts

Doing More with IP Addresses

IP addresses can be one of the most useful data artifacts in any analysis, but over the years I’ve seen a lot of people miss out on key attributes of  IP addresses to facilitate analysis.

What is an IP Address?

First of all, an IP address is a numerical label assigned to a network interface that uses the Internet Protocol for communications.  Typically they are written in dotted decimal notation like this: 128.26.45.188.  There are two versions of IP addresses in use today, IPv4, and IPv6.  The address shown before is a v4 address, and I’m going to write the rest of this article about v4 addresses, but virtually everything applies to v6 addresses as well.  The difference between v4 and v6 isn’t just the formatting.  IP addresses have to be unique within a given network and the reason v6 came into being was that we were rapidly running out of IP addresses!  In networking protocols, IPv4 addresses are 32bit unsigned integers with a maximum value of approximately 2 billion.  IPv6 increased that from 32bit to 128 bits resulting in 2128 possible IP addresses.

What do you do with IP Addresses?

If you are doing cyber security analysis, you will likely be looking at log files or perhaps entries in a database containing the IP address in the dotted decimal notation.  It is very common to count which IPs are querying a given server, and what these hosts are doing, etc.

Continue reading Doing More with IP Addresses

Share the joy
Leave a Comment

Thoughts and Goals for the Upcoming Year

It’s been an interesting year both career wise and generally.  This last year, I’ve had the amazing opportunity to speak at numerous conferences around the world, as well as give classes all over the world in data science and Apache Drill.  I’ve also learned a lot about the internals of Drill and even contributed to the codebase.  With that said, one can never rest on one’s laurels and as such I have a lot in store for the year.

Continue reading Thoughts and Goals for the Upcoming Year

Share the joy
Leave a Comment

Off Topic: Why I simultaneously love and hate Apple

I confess, I’m an Apple user.  My first computer when I was a wee lad was a Wozniak edition Apple IIGS, and then a Mac Plus.  I went over to PCs for a while but returned to the Mac years ago and love every minute of it.  I love the attention to detail that Apple puts into their products and the emphasis on good design.  I’m not an Apple snob or anything, but I do think it is a shame that Apple hasn’t made better penetration into large enterprises, and I admit that I do wince a little every time I go into a client site and see row after row of Windows PCs… but I digress.

Why I love Apple…

In the last update of iTunes for Mac, Apple did something that truly knocked my socks off:  Apple finally figured out how to handle classical music.

Let me explain.  If you are cataloging pop music, you will want to store the performer and the song title.  The album info may or may not be relevant but the key fields are the performer and song title.  In classical music, these fields are significantly less important in that the title of a piece might be “Symphony #5” or something like that.  What you really care about is the combination of composer AND the title.  (IE: Tchaikovsky’s violin concerto is a different work than Mendelssohn’s violin concerto) But wait… there’s more. Continue reading Off Topic: Why I simultaneously love and hate Apple

Share the joy
Leave a Comment

My Best Days at Work

December is always a quiet month at my job and as such, I’ve had a few days of quiet to geek out and work on a few projects that have been on the proverbial back burner.  While I’ve had a lot of great days at work and my particular favorite days are when I learn something new that knocks my socks off.  So I had one of these days last week during a geek-out day and I wanted to share what I learned. R Logo

I do a decent amount of coding and I tend to use Python for data manipulation and preparation.  I also do a lot of teaching and my personal preference is also to use Python for teaching because I find the syntax to be very easy for non-coders to grasp. I’m also a big fan of all the various libraries that have been written for Python which enable data scientists to focus on what they are trying to do without having to worry about how to do it. Continue reading My Best Days at Work

Share the joy
Leave a Comment

A Data Scientist’s Perspective on the Election and What Went Wrong

I originally drafted a version of this article in August, but decided not to post it because I didn’t want my blog to be political commentary.  However, given the shocking election results and the epic failure of the political polling/predictive analytics industries I couldn’t resist sharing my thoughts on the matter.  As a data scientist, I have been watching this election with a lot of anticipation and curiosity.  Back in August, my original draft was entitled “What happens to Data Science if Trump wins?”  and in it, I wrote some thoughts about what the impact would be to the data science world if Trump won.  The main premise was that a Trump victory would be disruptive in how political campaigns are run and most importantly, how the analytics used to measure political campaigns would be called into question.  I also thought that the value of the super-creepy targeted advertising that Facebook and other social media sites are using might get called into question.  But more on that later…  Lastly, I’m attempting to write this article without infusing my own political opinions into the central arguments.  If I am successful, the reader will have no idea what my political views are.

Ultimately, there are two questions which need answers:

  1. How is it that nearly every reputable news source and polling agency incorrectly predicted the election results?
  2. How can data science be used to avoid repeated errors of this scale?

The first point really bears some fleshing out.  It wasn’t just that everyone predicted a Clinton victory, it was that nearly every source–including the vaulted Nate Silver–predicted a massive Clinton victory.

For this discussion, I will presuppose–perhaps naively–that the pollsters and other political analytic professionals are not themselves biased and are in fact trying to give an accurate prediction as possible and not allowing their own opinions about the candidates influence their analysis.  With that said, I hypothesize that this election was a perfect storm of polling biases, groupthink, and poor use of data that in the end resulted in the massive failures that occurred on election day.

Continue reading A Data Scientist’s Perspective on the Election and What Went Wrong

Share the joy
Leave a Comment

The Biggest Problem in Data Science and How to Fix It

Imagine you have some process in your organization’s workflow that consumes 50%-90% of your staff’s time and contributes no value to the end result.  If you work in the data science or data analytics fields you don’t have to imagine that because I’ve just described what is, in my view, the biggest problem in advanced analytics today: the Extract/Transform/Load (ETL) process.  This range doesn’t come from thin air.  Studies from a few years ago from various sources concluded that data scientists were spending between 50%-90% of their time preparing their data for analysis.  (Example from Forbes, DatascienceCentral, New York Times) Furthermore, 76% of data scientists consider data preparation the least enjoyable part of their job, according to Forbes.

If you go to any trade show and walk the expo halls, you’ll see the latest whiz-bang tools to “leverage big data analytics in the cloud”, and you’ll be awed by some amazing visualization tools.  Many of these new products can do amazing things… once they have data, which brings us back to our original problem…that in order to use the latest whiz-bang tool you still have to invest considerable amounts of time in data prep.  The tools seem to skip that part, and focus on the final 10% of the process. Continue reading The Biggest Problem in Data Science and How to Fix It

Share the joy
Leave a Comment

Tips for Debugging Code without F-Bombs – Part 2

This post is a continuation of my previous tutorial about debugging code in which I discuss how preventing bugs is really the best way of debugging.  In this tutorial, we’re going to cover more debugging techniques and how to avoid bugs.

Types of Errors:

Ok, you’re testing frequently, and using good coding practices, but you’ve STILL got bugs.  What next??  Let’s talk about what kind of error you are encountering because that will determine the response.  Errors can be reduced to three basic categories: syntax errors, runtime errors, and the most insidious intent errors.  Let’s look at Syntax errors first. Continue reading Tips for Debugging Code without F-Bombs – Part 2

Share the joy
Leave a Comment

Why Musicians Make Good Analysts

I recently read Taming the Big Data Tidal Wave by Bill Franks of Teradata and in the book (which is going on my recommended reading list) he has a section about the ideal analyst.  While I am admittedly very biased on this one, Mr. Franks makes a very good point that in many instances the best analysts have a musical or other creative ability in addition to math and computer science skills.  In my experience, the best data scientists that I’ve worked with all have had some creative side to them–be it music, art or whatever.  Thus, here is my case why play an instrument is perhaps some of the best preparation to think like an analyst.

Musicians are trained in ETLravel-bolero

This may seem out of place, but consider what happens when a musician receives a piece of music to play for the first time.  Most musicians will read through the sheet music and either sing through it via solfege, or otherwise mentally convert the written notes on the page into a mental version of the music. Every musician has their own method, but basically, they’ll transform the notes on the page into a mental version of the music.   Continue reading Why Musicians Make Good Analysts

Share the joy
Leave a Comment

Drill UDF List

drillLogoI’ve been working on developing some custom functions for Drill, or User Defined Functions and I realized that there really should be a repository for Drill UDFs.  I’ve decided to create a page with links to all the UDFs that I know of.  I’ll keep this updated, so please if you have Drill UDFs that you want to share, please email me a link and I’ll put it up.

Share the joy
Leave a Comment