Category: Data Science

The End of Privacy As We Know It

Published March 29, 2017

In the news on Friday I saw a series of articles about a recent change in communication rules which was rejected by the Senate that would have prohibited ISPs from selling your browsing histories. I understand why ISPs would want to monetize this data, after all, this data would be extremely valuable to online advertisers to more accurately serve ads. But I think it should give us pause to ask the question is this in fact ethical?

While there really is no 1 to 1 comparison, the closest thing(s) would be either the telephone company selling your call records, or the post office (or other courier services such as UPS) aggregating and selling the information on the outside of your mail. I would strongly suspect that most people, if asked, would certainly not want their communication records sold to the highest bidder and yet that is precisely what Congress is allowing.

What Does This Mean for Privacy?

If ISPs are allowed to sell your browsing histories, I don’t believe that it is overstating things to say that this represents the end of privacy on the internet. While we didn’t have much privacy on the internet any these days anyway, but if the ISPs are allowed to sell browsing records, it’s pretty much over.

With that said, it is difficult to discern exactly what is going to be allowed under the new rule change, but if I’m reading the news articles correctly it will allow ISPs to sell records of metadata of your web browsing. To a competent analyst, this data would be a virtual gold mine for targeted advertising and all sorts of other services, none of which are really beneficial to the individual. As I’ve shown in my Strata talks about IoT data, (here and here) if you gather enough seemingly innocuous data about an individual, it is entirely possible to put together a very accurate picture of their life. From my own experience, if you were to look at my browsing history for a few months, you could very easily determine things like when my bills are due, what companies I do business with, when I go to work/bed, what chat services I use, things I may be interested in buying, what places I’m interested in visiting, etc. The bottom line is that I consider my web browsing to be personal. I don’t want to share that with anyone, not because I have something to hide, but rather because I want the choice. I see no benefit whatsoever to the consumer in this rule change.

What can you do to protect your privacy?

Unfortunately, there really aren’t a lot of options. From the technical perspective, there are several technical options–none great–to preserve your privacy. It is not possible to keep the ISPs from getting your data, but you can make that data useless with TOR and VPNs.

Virtual Private Network (VPN): VPNs have been traditionally used by corporations to allow remote access into private networks using the public internet. VPNs create a secure tunnel between your computer and a proxy server then your web traffic passes through that server–which can be anywhere in the world. For those of you who don’t work for large corporations, there are free and paid VPNs that you can use to access the web, however, I would avoid any free VPN service as they are likely making money by, you guessed it, collecting web traffic and analyzing it. VPNs may seem like an ideal countermeasure, however there are issues with VPNs as well. For starters, you are adding bottlenecks and complexity and hence losing speed. Secondly many sites–particularly sites that have geographically based licensing such as Netflix–block traffic from VPNs. VPNs don’t make you anonymous but they can make your data much more difficult to collect.
TOR: TOR stands for The Onion Router (https://en.wikipedia.org/wiki/Tor_(anonymity_network)) and it is similar to a VPN but instead of using one proxy server, TOR uses a series of encrypted relays and makes traffic much more difficult. TOR has been used in many countries to successfully evade internet censorship. TOR has the added benefit of allowing anonymous browsing, however, it does introduce additional complexity into your browsing. There also is a speed penalty for using TOR and you will find that you will not be able to access certain services using TOR.

Depending on how protective of your privacy you are, this may or may not matter, but it is important to understand that when using these technologies, guaranteeing your privacy depends on properly configuring them. One small misconfiguration can expose your personal data.

I should also mention here that the so-called privacy modes that most browsers include do absolutely nothing to protect your privacy over the network. Privacy mode erases your browsing history and cookies on your local machine, but you are still vulnerable to snooping over the network.

What else can I do?

This rule change represents a complete failure of government to do the thing it is really supposed to do–protecting the rights of its citizens. It’s sad that the whole world was up in arms in response to Snowden’s revelations, and yet the silence is deafening in response to unlimited, widespread corporate surveillance. Indeed, you have to read the hacker blogs (and my site) to find any kind of discussion of this issue. This story got virtually zero coverage in the news media.

What is a real shame is that this appears to have become a partisan issue in that the vote in the Senate was a strict party-line vote. It is entirely possible that the new Congress voted to repeal these rules simply because they were put in place by the previous administration.

At this point, the government is not looking out for its citizens’ interests in this regard and therefore it is upon individual citizens to take action to preserve our privacy. In addition to the technical measures listed above here are some suggestions for what you can do:

Contact your Congressional Representative(s) and Senator(s): The Congressional switchboard number is 202-224-3121. Always be courteous, professional and polite when speaking with Congressional Staff. Be sure to convey why you are calling. While it is unlikely that you will speak directly to your Senator or Congressman, their Staff have enormous influence and you should be respectful to them. Make it clear that you do not welcome corporate surveillance.
Educate Others: I suspect that the reason this received so little attention is that the average person doesn’t really understand security, privacy and the consequences of this kind of data collection. Therefore, it is incumbent upon those of us who work in data analytics and security to explain the implications of these policies in an understandable manner to non-technical people.

I would strongly urge everyone to do what they can to protest this rule change. If we do nothing, we might wake up one day and find that our online privacy has ceased to exist.

3 Comments

Doing More with IP Addresses

Doing More with IP Addresses

Published February 6, 2017

IP addresses can be one of the most useful data artifacts in any analysis, but over the years I’ve seen a lot of people miss out on key attributes of IP addresses to facilitate analysis.

What is an IP Address?

First of all, an IP address is a numerical label assigned to a network interface that uses the Internet Protocol for communications. Typically they are written in dotted decimal notation like this: 128.26.45.188. There are two versions of IP addresses in use today, IPv4, and IPv6. The address shown before is a v4 address, and I’m going to write the rest of this article about v4 addresses, but virtually everything applies to v6 addresses as well. The difference between v4 and v6 isn’t just the formatting. IP addresses have to be unique within a given network and the reason v6 came into being was that we were rapidly running out of IP addresses! In networking protocols, IPv4 addresses are 32bit unsigned integers with a maximum value of approximately 2 billion. IPv6 increased that from 32bit to 128 bits resulting in 2¹²⁸ possible IP addresses.

What do you do with IP Addresses?

If you are doing cyber security analysis, you will likely be looking at log files or perhaps entries in a database containing the IP address in the dotted decimal notation. It is very common to count which IPs are querying a given server, and what these hosts are doing, etc.

Leave a Comment

Thoughts and Goals for the Upcoming Year

Thoughts and Goals for the Upcoming Year

Published January 26, 2017

It’s been an interesting year both career wise and generally. This last year, I’ve had the amazing opportunity to speak at numerous conferences around the world, as well as give classes all over the world in data science and Apache Drill. I’ve also learned a lot about the internals of Drill and even contributed to the codebase. With that said, one can never rest on one’s laurels and as such I have a lot in store for the year.

Leave a Comment

My Best Days at Work

My Best Days at Work

Published December 14, 2016

December is always a quiet month at my job and as such, I’ve had a few days of quiet to geek out and work on a few projects that have been on the proverbial back burner. While I’ve had a lot of great days at work and my particular favorite days are when I learn something new that knocks my socks off. So I had one of these days last week during a geek-out day and I wanted to share what I learned. R Logo

I do a decent amount of coding and I tend to use Python for data manipulation and preparation. I also do a lot of teaching and my personal preference is also to use Python for teaching because I find the syntax to be very easy for non-coders to grasp. I’m also a big fan of all the various libraries that have been written for Python which enable data scientists to focus on what they are trying to do without having to worry about how to do it.

Leave a Comment

A Data Scientist’s Perspective on the Election and What Went Wrong

A Data Scientist’s Perspective on the Election and What Went Wrong

Published November 14, 2016

I originally drafted a version of this article in August, but decided not to post it because I didn’t want my blog to be political commentary. However, given the shocking election results and the epic failure of the political polling/predictive analytics industries I couldn’t resist sharing my thoughts on the matter. As a data scientist, I have been watching this election with a lot of anticipation and curiosity. Back in August, my original draft was entitled “What happens to Data Science if Trump wins?” and in it, I wrote some thoughts about what the impact would be to the data science world if Trump won. The main premise was that a Trump victory would be disruptive in how political campaigns are run and most importantly, how the analytics used to measure political campaigns would be called into question. I also thought that the value of the super-creepy targeted advertising that Facebook and other social media sites are using might get called into question. But more on that later… Lastly, I’m attempting to write this article without infusing my own political opinions into the central arguments. If I am successful, the reader will have no idea what my political views are.

Ultimately, there are two questions which need answers:

How is it that nearly every reputable news source and polling agency incorrectly predicted the election results?
How can data science be used to avoid repeated errors of this scale?

The first point really bears some fleshing out. It wasn’t just that everyone predicted a Clinton victory, it was that nearly every source–including the vaulted Nate Silver–predicted a massive Clinton victory.

For this discussion, I will presuppose–perhaps naively–that the pollsters and other political analytic professionals are not themselves biased and are in fact trying to give an accurate prediction as possible and not allowing their own opinions about the candidates influence their analysis. With that said, I hypothesize that this election was a perfect storm of polling biases, groupthink, and poor use of data that in the end resulted in the massive failures that occurred on election day.

Leave a Comment

The Biggest Problem in Data Science and How to Fix It

The Biggest Problem in Data Science and How to Fix It

Published November 6, 2016

Imagine you have some process in your organization’s workflow that consumes 50%-90% of your staff’s time and contributes no value to the end result. If you work in the data science or data analytics fields you don’t have to imagine that because I’ve just described what is, in my view, the biggest problem in advanced analytics today: the Extract/Transform/Load (ETL) process. This range doesn’t come from thin air. Studies from a few years ago from various sources concluded that data scientists were spending between 50%-90% of their time preparing their data for analysis. (Example from Forbes, DatascienceCentral, New York Times) Furthermore, 76% of data scientists consider data preparation the least enjoyable part of their job, according to Forbes.

If you go to any trade show and walk the expo halls, you’ll see the latest whiz-bang tools to “leverage big data analytics in the cloud”, and you’ll be awed by some amazing visualization tools. Many of these new products can do amazing things… once they have data, which brings us back to our original problem…that in order to use the latest whiz-bang tool you still have to invest considerable amounts of time in data prep. The tools seem to skip that part, and focus on the final 10% of the process.

1 Comment

Tips for Debugging Code without F-Bombs – Part 2

Tips for Debugging Code without F-Bombs – Part 2

Published October 9, 2016

This post is a continuation of my previous tutorial about debugging code in which I discuss how preventing bugs is really the best way of debugging. In this tutorial, we’re going to cover more debugging techniques and how to avoid bugs.

Types of Errors:

Ok, you’re testing frequently, and using good coding practices, but you’ve STILL got bugs. What next?? Let’s talk about what kind of error you are encountering because that will determine the response. Errors can be reduced to three basic categories: syntax errors, runtime errors, and the most insidious intent errors. Let’s look at Syntax errors first.

1 Comment

Why Musicians Make Good Analysts

Why Musicians Make Good Analysts

Published October 7, 2016

I recently read Taming the Big Data Tidal Wave by Bill Franks of Teradata and in the book (which is going on my recommended reading list) he has a section about the ideal analyst. While I am admittedly very biased on this one, Mr. Franks makes a very good point that in many instances the best analysts have a musical or other creative ability in addition to math and computer science skills. In my experience, the best data scientists that I’ve worked with all have had some creative side to them–be it music, art or whatever. Thus, here is my case why play an instrument is perhaps some of the best preparation to think like an analyst.

Musicians are trained in ETL

This may seem out of place, but consider what happens when a musician receives a piece of music to play for the first time. Most musicians will read through the sheet music and either sing through it via solfege, or otherwise mentally convert the written notes on the page into a mental version of the music. Every musician has their own method, but basically, they’ll transform the notes on the page into a mental version of the music.

Leave a Comment

Slides and video from my talk at Strata London 2016

Published September 8, 2016

I gave a presentation at Strata London 2016 about data gathered from cars. This is an ongoing research interest of mine but here are the slides and the video.

Leave a Comment

Drill UDF List

Drill UDF List

Published August 25, 2016

drillLogo I’ve been working on developing some custom functions for Drill, or User Defined Functions and I realized that there really should be a repository for Drill UDFs. I’ve decided to create a page with links to all the UDFs that I know of. I’ll keep this updated, so please if you have Drill UDFs that you want to share, please email me a link and I’ll put it up.

Leave a Comment