I had the honor of doing a podcast with O’Reilly Media about the importance of training all levels of staff in Data Science. Here is the complete podcast:
Category: Data Science
A big interest of mine is how to impart what little I know of the tools and techniques of data science to others. When I was at Booz Allen, I taught numerous classes both for internal staff and for various clients. I’ve also taught for Metis, O’Reilly Publishing and for the last three years, at BlackHat so I do have some experience in the matter. I’ve looked at MANY data science programs to see if what they are teaching lines up what I’m teaching and I’d like to share some things which I’ve noticed which will hopefully help you build a better data science program. My goal here is to share my mistakes and experiences over the years and hopefully if you are building a data science training program, you can learn from what I learned the hard way. I make no claims to be the perfect data science instructor, and I’ve made plenty of mistakes along the way.
While I’m at it, I’ll put in a plug for an upcoming data science class which I am teaching with Jay Jacobs of BitSight Security at the O’Reilly Security Conference in NYC, October 29-30th.
Really, data science instruction is an optimization problem: as an instructor, your goal is to minimize confusion whilst maximizing understanding. To do this, you must remove as many obstacles as possible from the students’ path which create dissonance. This may seem silly, but I have observed that if you have small errata in your code, or your code doesn’t work on their machine, even due to something they did, it significantly detracts from their learning experience and their opinion of you as an instructor. Therefore, removing all these obstacles to understanding is vital to your success as an instructor.
I am fortunate enough to get regular messages from recruiters on LinkedIn asking to speak with me about software development jobs. Here’s the thing… I’m not a software developer, I do data science and data analytics. For the last seven years, my job title has included the words “data” and “scientist” in the title. I have never held a position with the words “Software” and “Developer” in the title. I have taught and am currently teaching classes with titles such as “Data Science for Security Professionals” and “Applied Data Science for Security”. All of this is on my LinkedIn profile, yet despite this, the messages continue.
On some level, it makes sense. If you look at my resume, you’d see that I have a degree in computer science, experience with various coding languages, and projects on github. Hell, I’m a committer for Apache Drill…
So what’s the difference between a data scientist and software developer?
One of the big issues I’ve encountered in my teaching is explaining how to evaluate the performance of machine learning models. Simply put, it is relatively trivial to generate the various performance metrics–accuracy, precision, recall, etc–if you wanted to visualize any of these metrics, there wasn’t really an easy way to do that. Until now….
Recently, I learned of a new python library called YellowBrick, developed by Ben Bengfort at District Data Labs, that implements many different visualizations that are useful for building machine learning models and assessing their performance. Many visualization libraries require you to write a lot of “boilerplate” code: IE just generic and repetitive code, however what impressed me about YellowBrick is that it largely follows the scikit-learn API, and therefore if you are a regular user of scikit-learn, you’ll have no problem incorporating YellowBrick into your workflow. YellowBrick appears to be relatively new, so there still are definitely some kinks to be worked out, but overall, this is a really impressive library.
I love my MacBook Pro. Quite honestly, it’s the best laptop I’ve ever owned. However, my one regret is not buying the larger hard drive. Anyway, over the last few months, I’ve noticed that my free disk space kept on shrinking. I did all the usual stuff, deleted unneeded applications, ran various disk cleaning tools, etc until finally, I hit the motherlode… I discovered that brew, everyone’s favorite package manager was archiving old versions every time you ran brew update!!
To fix this problem… simply run:
brew cleanup. I did this and voila! 10 GB of hard disk space cleaned up!
My colleagues Austin Taylor and Melissa Kilby are proud to announce the first stable release of Griffon: A Virtual Machine for Data Science. Griffon is a virtual machine which contains many data science tools pre-configured, installed and linked up to make it so that you don’t have to be a Linux expert to try them out. If you are teaching a class, or if you are simply wanting to learn more about a particular tool, then Griffon is perfect for you.
You can download Griffon here: https://github.com/gtkcyber/griffon-vm.
data.world is rapidly establishing itself as the premier site for data scientists and analysts to host and collaborate on datasets. I have been impressed with data.world’s growth and interested in starting to use the platform in my professional projects. On data.world, datasets can be open and visible to the general public or they can be private, with visibility limited to select contributors. That is sufficient to guarantee the privacy of the data most of the time. However, in some cases, you may be explicitly prohibited from uploading data to the cloud.
Would it be possible to use data.world in a project even when part of the data must not live in the cloud?
It didn’t take me long to answer that question. Fortunately, I also have been doing a meaningful amount of experimentation and development with Apache Drill over the last few years. What impresses me about Drill is its versatility and potential to dramatically increase analytic productivity, open up previously inaccessible data sources, query across data silos, and do so with the common language of ANSI SQL.
As I began experimenting with both, I couldn’t help but wonder if it might be possible to somehow combine the two.
Well, it turns out, it is…
I received the following comment on an article: Let’s Stop Using the Term Fake Data Scientist and thought it merited a response. Usually the comments I receive are constructive even if they disagree with what I wrote, but this particular comment, demonstrated an arrogance which I believe is a huge problem in the data science world.
You can of course read the original article here, but the basic point was that data science is interdisciplinary field–consisting of a mixture of computer science, applied mathematics, and subject matter expertise, with a smattering of data visualization and communication skills. I believe that it is inappropriate to label someone as a fake simply because their skillset is proportioned differently than many math-centric data scientists. I’m also a believer in Dr. Carol Dweck’s thesis on having a growth-oriented mindset (as stated in her book Mindset) and that people who might be working in data science but whose skills need development in a certain area, should be given instruction and assistance rather than derogatory labels.
In the news on Friday I saw a series of articles about a recent change in communication rules which was rejected by the Senate that would have prohibited ISPs from selling your browsing histories. I understand why ISPs would want to monetize this data, after all, this data would be extremely valuable to online advertisers to more accurately serve ads. But I think it should give us pause to ask the question is this in fact ethical?
While there really is no 1 to 1 comparison, the closest thing(s) would be either the telephone company selling your call records, or the post office (or other courier services such as UPS) aggregating and selling the information on the outside of your mail. I would strongly suspect that most people, if asked, would certainly not want their communication records sold to the highest bidder and yet that is precisely what Congress is allowing.
What Does This Mean for Privacy?
If ISPs are allowed to sell your browsing histories, I don’t believe that it is overstating things to say that this represents the end of privacy on the internet. While we didn’t have much privacy on the internet any these days anyway, but if the ISPs are allowed to sell browsing records, it’s pretty much over.
With that said, it is difficult to discern exactly what is going to be allowed under the new rule change, but if I’m reading the news articles correctly it will allow ISPs to sell records of metadata of your web browsing. To a competent analyst, this data would be a virtual gold mine for targeted advertising and all sorts of other services, none of which are really beneficial to the individual. As I’ve shown in my Strata talks about IoT data, (here and here) if you gather enough seemingly innocuous data about an individual, it is entirely possible to put together a very accurate picture of their life. From my own experience, if you were to look at my browsing history for a few months, you could very easily determine things like when my bills are due, what companies I do business with, when I go to work/bed, what chat services I use, things I may be interested in buying, what places I’m interested in visiting, etc. The bottom line is that I consider my web browsing to be personal. I don’t want to share that with anyone, not because I have something to hide, but rather because I want the choice. I see no benefit whatsoever to the consumer in this rule change.
What can you do to protect your privacy?
Unfortunately, there really aren’t a lot of options. From the technical perspective, there are several technical options–none great–to preserve your privacy. It is not possible to keep the ISPs from getting your data, but you can make that data useless with TOR and VPNs.
- Virtual Private Network (VPN): VPNs have been traditionally used by corporations to allow remote access into private networks using the public internet. VPNs create a secure tunnel between your computer and a proxy server then your web traffic passes through that server–which can be anywhere in the world. For those of you who don’t work for large corporations, there are free and paid VPNs that you can use to access the web, however, I would avoid any free VPN service as they are likely making money by, you guessed it, collecting web traffic and analyzing it. VPNs may seem like an ideal countermeasure, however there are issues with VPNs as well. For starters, you are adding bottlenecks and complexity and hence losing speed. Secondly many sites–particularly sites that have geographically based licensing such as Netflix–block traffic from VPNs. VPNs don’t make you anonymous but they can make your data much more difficult to collect.
- TOR: TOR stands for The Onion Router (https://en.wikipedia.org/wiki/Tor_(anonymity_network)) and it is similar to a VPN but instead of using one proxy server, TOR uses a series of encrypted relays and makes traffic much more difficult. TOR has been used in many countries to successfully evade internet censorship. TOR has the added benefit of allowing anonymous browsing, however, it does introduce additional complexity into your browsing. There also is a speed penalty for using TOR and you will find that you will not be able to access certain services using TOR.
Depending on how protective of your privacy you are, this may or may not matter, but it is important to understand that when using these technologies, guaranteeing your privacy depends on properly configuring them. One small misconfiguration can expose your personal data.
I should also mention here that the so-called privacy modes that most browsers include do absolutely nothing to protect your privacy over the network. Privacy mode erases your browsing history and cookies on your local machine, but you are still vulnerable to snooping over the network.
What else can I do?
This rule change represents a complete failure of government to do the thing it is really supposed to do–protecting the rights of its citizens. It’s sad that the whole world was up in arms in response to Snowden’s revelations, and yet the silence is deafening in response to unlimited, widespread corporate surveillance. Indeed, you have to read the hacker blogs (and my site) to find any kind of discussion of this issue. This story got virtually zero coverage in the news media.
What is a real shame is that this appears to have become a partisan issue in that the vote in the Senate was a strict party-line vote. It is entirely possible that the new Congress voted to repeal these rules simply because they were put in place by the previous administration.
At this point, the government is not looking out for its citizens’ interests in this regard and therefore it is upon individual citizens to take action to preserve our privacy. In addition to the technical measures listed above here are some suggestions for what you can do:
- Contact your Congressional Representative(s) and Senator(s): The Congressional switchboard number is 202-224-3121. Always be courteous, professional and polite when speaking with Congressional Staff. Be sure to convey why you are calling. While it is unlikely that you will speak directly to your Senator or Congressman, their Staff have enormous influence and you should be respectful to them. Make it clear that you do not welcome corporate surveillance.
- Educate Others: I suspect that the reason this received so little attention is that the average person doesn’t really understand security, privacy and the consequences of this kind of data collection. Therefore, it is incumbent upon those of us who work in data analytics and security to explain the implications of these policies in an understandable manner to non-technical people.
I would strongly urge everyone to do what they can to protest this rule change. If we do nothing, we might wake up one day and find that our online privacy has ceased to exist.
IP addresses can be one of the most useful data artifacts in any analysis, but over the years I’ve seen a lot of people miss out on key attributes of IP addresses to facilitate analysis.
What is an IP Address?
First of all, an IP address is a numerical label assigned to a network interface that uses the Internet Protocol for communications. Typically they are written in dotted decimal notation like this: 188.8.131.52. There are two versions of IP addresses in use today, IPv4, and IPv6. The address shown before is a v4 address, and I’m going to write the rest of this article about v4 addresses, but virtually everything applies to v6 addresses as well. The difference between v4 and v6 isn’t just the formatting. IP addresses have to be unique within a given network and the reason v6 came into being was that we were rapidly running out of IP addresses! In networking protocols, IPv4 addresses are 32bit unsigned integers with a maximum value of approximately 2 billion. IPv6 increased that from 32bit to 128 bits resulting in 2128 possible IP addresses.
What do you do with IP Addresses?
If you are doing cyber security analysis, you will likely be looking at log files or perhaps entries in a database containing the IP address in the dotted decimal notation. It is very common to count which IPs are querying a given server, and what these hosts are doing, etc.