IP addresses can be one of the most useful data artifacts in any analysis, but over the years I’ve seen a lot of people miss out on key attributes of IP addresses to facilitate analysis.
What is an IP Address?
First of all, an IP address is a numerical label assigned to a network interface that uses the Internet Protocol for communications. Typically they are written in dotted decimal notation like this: 188.8.131.52. There are two versions of IP addresses in use today, IPv4, and IPv6. The address shown before is a v4 address, and I’m going to write the rest of this article about v4 addresses, but virtually everything applies to v6 addresses as well. The difference between v4 and v6 isn’t just the formatting. IP addresses have to be unique within a given network and the reason v6 came into being was that we were rapidly running out of IP addresses! In networking protocols, IPv4 addresses are 32bit unsigned integers with a maximum value of approximately 2 billion. IPv6 increased that from 32bit to 128 bits resulting in 2128 possible IP addresses.
What do you do with IP Addresses?
If you are doing cyber security analysis, you will likely be looking at log files or perhaps entries in a database containing the IP address in the dotted decimal notation. It is very common to count which IPs are querying a given server, and what these hosts are doing, etc.
I’m going to share an anecdote about a client that I worked with. The client had a system which stored IP addresses in a MySQL database and a rudimentary GUI to build queries. If you wanted to query a range of IP address or a CIDR Block, there was a form which generated a ridiculously long regular expression for the comparison, which leads me to the mistake that many people make with IP addresses: storing IP addresses as strings or VARCHARs.
You might be asking at this point, how else should I store an IP address? Well, remember that an IP address is a 32 or 128 bit unsigned integer and storing the IP as an integer can have many advantages which will facilitate analysis:
- Integers use less space: If you are using a database, the 32bit INT takes up 4 bytes whereas you will need 15bytes to store an IP address as a string. This may seem insignificant, but if you are storing billions of the things, it can make a big difference in the amount of disk space you need.
- Performance is better: It will take less time to index, scan or compare an integer index than a string index.
- Analysis is Easier: Here’s the real kicker… it’s much easier to compare, sort or otherwise analyze numbers than it is to analyze strings.
Just as some examples, if you are using MySQL and wanted to find all the IP addresses greater than 184.108.40.206, you could execute the following query:
SELECT * FROM <data> WHERE INET_ATON( ip_string_field ) > INET_ATON( '220.127.116.11' )
If the data was already an int, you could use the following query:
SELECT * FROM <data> WHERE ip_int_field > INET_ATON( '18.104.22.168' )
Likewise, if you had data containing IP addresses, and you wanted to sort them, if the addresses are strings, sorting them will get you ASCII order which is definitely not what you are looking for. Just as an example–this time in Apache Drill–I ran the following query:
SELECT connection_client_host FROM dfs.drillworkshop.`log_files/small-server-log.httpd` ORDER BY connection_client_host
which yielded the following result:
+-------------------------+ | connection_client_host | +-------------------------+ | 22.214.171.124 | | 126.96.36.199 | | 188.8.131.52 | | 184.108.40.206 | | 220.127.116.11 | +-------------------------+
WRONG!!! However, watch what happens if we convert the IPs to integers before sorting them.
0: jdbc:drill:zk=local> SELECT connection_client_host . . . . . . . . . . . > FROM dfs.drillworkshop.`log_files/small-server-log.httpd` . . . . . . . . . . . > ORDER BY INET_ATON( connection_client_host ); +-------------------------+ | connection_client_host | +-------------------------+ | 18.104.22.168 | | 22.214.171.124 | | 126.96.36.199 | | 188.8.131.52 | | 184.108.40.206 | +-------------------------+ 5 rows selected (0.248 seconds)
You can see in the above example that the IPs are correctly sorted.
Converting IP Addresses to Integers
Hopefully, you’re wondering at this point about how to convert IP addresses to integers and back to strings. Here’s the good news, most databases and programming languages have functions to easily accomplish this task for you. MySQL has the INET_ATON() and INET_NTOA() functions to perform this function for you. If you are an Apache Drill user you can use my networking functions (available here) to do this. PHP, Python and Java all have networking libraries which have this functionality built in and a quick google search can show you how to do it (or contact me). The good news also is that this operation can be accomplished with simple bit-shifting so that if whatever language you are using doesn’t have a built in function (like R!!), you can put on your developer’s hat and write one yourself.
Accessing Additional IP Metadata
Manipulating IP addresses is extremely useful, but IPs also contain a wealth of useful metadata which can be extracted if you know how to do it. For example, IP addresses can be associated with a particular geographic location. Although geo-locating IPs is inherently imprecise, it can give you at least an idea of where in the world the traffic originated. Most programming languages such as Python, Java, PHP, and even R have APIs to query the ubiquitous MaxMind geoip database which can give you a wealth of information about a given IP address. MaxMind has a paid and a free version of their service and both can be extremely useful for getting additional metadata about IP addresses.
Come to BlackHat and learn even more!
If this is of interest to you, this is one of many topics that will be covered at my upcoming data science class at BlackHat USA 2017! Register now for the early bird pricing.