I have been working on using Apache Drill for security purposes and I wanted to demonstrate how you can use Drill in a real security challenge. I found this contest which included a PCAP file of an actual attack, as well as a series of questions you would want to answer in order to the analysis. (https://www.honeynet.org/node/504)
My thought here is that Drill’s advanced ETL capabilities are not terribly useful if you can’t use Drill to do basic stuff that tools like Wireshark can do already, so I wanted to see if it would work in real life. This example was good because I also had “the answers” so I could see how Drill stacks up to the contest winners.
First, I had to see if Drill could actually read the PCAP. The PCAP reader can be a bit wonky, but fortunately, Drill read it without issues! (Whew!). For these examples, I will be using Drill and Superset.
Part one will contain a demonstration of how to use Drill to answer the questions in the first part of the challenge.
I saw this image on LinkedIn a few days ago and realized that it is proof of the future of data science. The image is of the leaderboard from a Kaggle competition, which isn’t particularly remarkable, but what is remarkable is the competitor in 2nd place: Google AutoML. Not only did AutoML come in 2nd, but it did so in fewer entries and the score was 0.00093 off of first place.
You might be thinking, “Well that’s Google, and they have the best stuff. Technology like that will not be available to the masses anytime soon, and if it is, it will require massive clusters.” Au contraire mon frère. There are a slew of new Python modules which automate various phases of the machine learning process. My personal favorite of which is called TPOT. TPOT is a Python module which automates the entire machine learning process, and generates python code for your entire pipeline.
I did a little experiment with TPOT and was able to build a model with data from Kaggle that scored in the top 10 for a simple exercise.
At some point, Google will likely make their AutoML available to the public if it isn’t already, and data scientists will have to prove that their value over automated machine learning tools.
The significance of this is enormous. Since the coining of the term data science, many people have focused very heavily on the math and machine learning aspects of data science. These aspects are certainly important, but these steps can be automated, as you can see, with ever improving performance. What this means in the long run is that as available computing power increases and these tools get better and faster, the understanding of the inner workings of the algorithms will become less and less important. (This is not true if you are working at a really cutting edge company that is developing new algorithms, or doing academic research.)
Therefore, if you are a data scientist or an aspiring data scientist, should you quit now? Hardly. Automated ML is really exciting because it will enable you to focus on the things that computers can’t do, and likely won’t ever be able to do, which are: conceiving and defining data problems, communicating the results to stakeholders, as well as the data cleaning/feature engineering steps. Automated machine learning will enable or force data scientists to focus on tasks that truly require human thought and using data science to add value to their organizations.
Happy New Year everyone! I’ve been taking a bit of a blog break after completing Learning Apache Drill, teaching a few classes, and some personal travel but I’m back now and have a lot planned for 2019! One of my long standing projects is to get Apache Drill to work with various open source visualization and data flow tools. I attended the Strata conference in San Jose in 2016 where I attended Maxime Beauchemin’s talk (slides available here) where he presented the tool then known as Caravel and I was impressed, really really impressed. I knew that my mission after the conference would be to get this tool to work with Drill. A little over two years later, I can finally declare victory. Caravel went through a lot of evolution. It is now an Apache Incubating project and the name has changed to Apache (Incubating) Superset.
UPDATE: The changes to Superset have been merged, so you can just install Superset as described on their website.
Happy belated New Year everyone! I’ve been taking a bit of a blog break as I’ve been quite busy between work, personal travel, and working on my startup GTK Cyber. But I’m back now and have some exciting news! My team and I have been accepted to teach Applied Data Science course once again at BlackHat in Las Vegas! This year we’ve made a major change to our course: it’s now a full four days instead of two!
Well, we did it. I finally finished the book that I had been working on with my co-author for the last two years. I thought I’d write a short post on my experiences writing a technical book and getting it published. I know many people think about writing books, and I’d like to share my experiences so that others might learn from lessons that I learned the hard way. Overall, it was an absolutely amazing experience and I have a feeling that the adventure is only beginning….
I am currently attending the Splunk .conf in Orlando, and a director at Accenture asked me this question, which I thought merited a blog post. Why don’t data scientists use or like Splunk. The inner child in me was thinking, “Splunk isn’t good at data science”, but the more seasoned professional in me actually articulated a more logical and coherent answer, which I thought I’d share whilst waiting for a talk to start. Here goes:
I cannot pretend to speak for any community of “data scientists” but it is true that I know a decent number of data scientists, some very accomplished and some beginners, and not a one would claim to use Splunk as one of their preferred tools. Indeed, when the topic of available tools comes up among most of my colleagues and the word Splunk is mentioned, it elicits groans and eye rolls. So let’s look at why that is the case:
Someone recently asked me for assistance with a university project whereby they were asked to predict whether a given article was fake news or not. They had a target accuracy of 70%. Since the topic of fake news has been in the news a lot, it made me think about how I would approach this problem and whether it is even possible to use machine learning to identify fake news. At first glance, this problem might be comparable to spam detection, however the problem is actually much more complicated. In an article on The Verge, Dean Pomerleau of Carnegie Mellon University states:
“We actually started out with a more ambitious goal of creating a system that could answer the question ‘Is this fake news, yes or no?’ We quickly realized machine learning just wasn’t up to the task.”
Last Friday, the Apache Drill released Drill version 1.14 which has a few significant features (plus a few that are really cool!) that will enable you to use Drill for analyzing security data. Drill 1.14 introduced:
A logRegex reader which enables Drill to read anything you can describe with a Regex
An image metadata reader, which enables you to query images
A suite a of GIS functionality
A collection of phonetic and string distance functions which can be used for approximate string matching.
These suite of functionality really expands what is possible with Drill, and makes analysis of many different types of data possible. This brief tutorial will walk you through how to configure Apache Drill to query log files, or any file really that can be matched with a regex.
I recently completed Technically Wrong by Sara Wachter-Boettcher. Let me start by saying that I’m glad that Ms. Wachter-Boettcher wrote this book. The tech industry has a lot of issues which need to be brought out into the open and it is definitely a positive development that people such as Ms. Wachter-Boettcher are bringing these issues to the forefront. It really is only recently that people are discussing the continuous erosion of privacy, misogyny in the tech industry, lack of diversity and many other issues. Whilst I would not deny any of these issues, I felt Wachter-Boettcher’s analysis was somewhat lacking and didn’t really get at the realities of working in the tech industry. Wachter-Boettcher cites numerous examples of tech gone wrong, such as a smart scale telling a two year old that he needs to lose weight, FaceBook denying a Native American person an account because it felt that their name was not legitimate, and the abhorrent use of proprietary, black box algorithms to make parole recommendations.
Again, it is definitely a positive development that Wachter-Boettcher and others are writing about these issues, but the alternatives and solutions she proposes seem a bit simplistic. While she doesn’t state this directly, much of the book seems to suggest that all of technology’s woes are caused by the lack of diversity in the tech industry. Specifically that “white guys” from elite universities are running everything. I don’t have an electronic copy of the book, but after about half way through this, I wanted to count the number of times the phrase “white guys” appears in the book. Sometimes this phrase includes Asians, sometimes not.
In the last week, beneath all the Trump and Kim Jong Un reporting, were several stories that state that Apple has in effect declared war on data collectors. Make no mistake, what Apple is doing is making it significantly harder for companies big and small to collect your personal data. The significance of this cannot be overstated in that many companies like Google and Facebook’s revenue is based on selling targeted advertising and if gathering this data becomes significantly more difficult, it could affect their bottom lines.
The First Volley: No More Comments and Share Buttons
Last week, I was listening to the keynotes at the WWDC, and overall was pretty unimpressed as exec after exec droned on about new animojis or some other feature that I really didn’t care about, and then, Craig Federighi launched the first volley: Safari is going to block FaceBook and other social media like and share buttons as well as shared comment sections. Facebook, Twitter and other sites use these buttons to track your activity when you are visiting other sites. While it isn’t that big of a deal that this is happening on MacOS, it is VERY significant that Apple is instituting this change on iOS as well. When I heard this, I was pretty shocked, but that was only the first volley, there were more to come.