Skip to content

Category: Data Science

Drilling Security Data

Last Friday, the Apache Drill released Drill version 1.14 which has a few significant features (plus a few that are really cool!) that will enable you to use Drill for analyzing security data.  Drill 1.14 introduced:

  • A logRegex reader which enables Drill to read anything you can describe with a Regex
  • An image metadata reader, which enables you to query images
  • A suite a of GIS functionality
  • A collection of phonetic and string distance functions which can be used for approximate string matching.  

These suite of functionality really expands what is possible with Drill, and makes analysis of many different types of data possible.  This brief tutorial will walk you through how to configure Apache Drill to query log files, or any file really that can be matched with a regex.

Leave a Comment

Apple’s Newly Declared War on Data Collection (and Facebook?)

In the last week, beneath all the Trump and Kim Jong Un reporting, were several stories that state that Apple has in effect declared war on data collectors.  Make no mistake, what Apple is doing is making it significantly harder for companies big and small to collect your personal data.  The significance of this cannot be overstated in that many companies like Google and Facebook’s revenue is based on selling targeted advertising and if gathering this data becomes significantly more difficult, it could affect their bottom lines.

The First Volley:  No More Comments and Share Buttons

Last week, I was listening to the keynotes at the WWDC, and overall was pretty unimpressed as exec after exec droned on about new animojis or some other feature that I really didn’t care about, and then, Craig Federighi launched the first volley: Safari is going to block FaceBook and other social media like and share buttons as well as shared comment sections.  Facebook, Twitter and other sites use these buttons to track your activity when you are visiting other sites.  While it isn’t that big of a deal that this is happening on MacOS, it is VERY significant that Apple is instituting this change on iOS as well.  When I heard this, I was pretty shocked, but that was only the first volley, there were more to come.

Leave a Comment

Adventures and Misadventures in Data Science Interviews

I’ve been waiting for some time to publish this, but I wanted to write about my experiences interviewing for data science jobs. Here’s my story, I worked at Booz Allen for nearly seven years but I felt it was time for a change. I very much like Booz Allen as a company and if anyone is interested in working there, please don’t hesitate to contact me.  But I felt I was ready for different challenges and started looking for work elsewhere.

Now that I started a new position, I thought I’d share some observations about what I learned from interviewing at numerous companies. I wasn’t tracking how many companies I interviewed with, but it was a lot. I have a lot of government experience and got a number of offers from government contracting firms. However, I came to the conclusion that in terms of career progression, joining another government contracting firm was not what I was looking for.

So here’s what I learned…

1 Comment

Why more women don’t code: A heartbreaking story with a good ending

I’ve been reading a lot lately about the ills of the tech industry, with a few book reviews in my queue to finish, and I posted a question on LinkedIn about what inspired people to get into tech.  My motivation was to see if there was a difference in men and women.  My hypothesis is that there are societal and cultural factors which discourage girls and women from studying tech (math, Computer science, engineering etc) and hence there aren’t enough qualified women to fill the tech jobs, and ultimately we end up with the current state of affairs where men outnumber women 3 or 4 to 1 in most tech companies.

Anyway, I received the following private response from a former student to whom I shall refer as S.  S was a student in one of my recent classes and a delight to work with.  Her story is absolutely heartbreaking and needs to be heard.  I lightly edited it, only to remove some details which would identify her.

You can share my story but I am not ready to have my name on it. I am going to be looking for a job soon and not everyone will appreciate it. They will see me as slow and too old. I was 54 before I gave myself to permission to study tech and coding languages. Growing up, girls were not encouraged to study math. I was teased about my abilities in math, because I could not recite the times tables. I grew up believing that I couldn’t do math. At home, my brother received an early TI calculator. It was supposed to be shared between us but that didn’t happen. Besides being annoying, it was clear that electronics were not for girls.

I began my university studies in psychology which seemed the only science that did not require math. I was bored and I tried to learn math on my own. Actual classes involved grades and that was disastrous. I somehow passed all the mathematics prerequisites and ended up in graduate school for chemistry. During my quantum mechanics class, I struggled. I went for help from the professor. He realized that I couldn’t do the times tables verbally and completely humiliated me.

At my job, I learned about agile, innovation and human centered design. I loved that these ideas as they provided a framework and a fresh vocabulary to talk about science and problem solving instead of just math. I excelled at facilitating these techniques. Many of the prototypes we needed to wireframe involved a website or an application. I became curious about data and technology, but I would never let myself work in this area. The risk of humiliation was too great. My supervisor already realized that I stumbled verbally with numbers. I did not want to be in a position to lose my job while trying out new skills.

About the same time, I had a routine hearing evaluation and I was diagnosed with great hearing but a serious auditory processing disorder. The audiologist predicted that I probably had terrible problems with spoken arithmetic and verbal math. I was thunderstruck. How could he know that? I had been punished as a child for exactly this issue. I internalized it as part of my self image. Although I was a great reader, I was unable to recite the times tables and do my arithmetic. I couldn’t explain why, maybe I really was a bad kid. After digesting the audiologist’s report, I allowed myself to become more interested in data and technology.

I fight paralyzing “imposter syndrome” every time I sit in front of my computer. I began to take free classes on-line and go to meet-ups and learn even, when I couldn’t talk about it well. I joined groups for women who code.  I continue to learn and I just signed up for an intensive software engineering boot camp. I currently volunteer as a teaching assistant for introductory python at two different community women’s coding groups. I continue to attend meet-ups.  I am not yet where I want to be but I am finally allowed to move ahead. Data is going to change our world and I don’t want to miss out.

Leave a Comment

A New Threat: Stalkerware

What would you do if you attended a political event or protest and the next day, you receive targeted adverts for that political cause?  Would that be cause for concern?  After all, you don’t post about your political views, how did the advertisers know?  You didn’t sign any rosters or register, so how did they know you were there?

I recently became aware of a new category of computer-evil: stalkerware.  I thought I was being clever and would have the privilege of coining a new term, but a few other people have already coined the term.  However, I would like to propose a slightly different definition.  In an article originally appearing on Motherboard, stalkerware is defined as:

Stalkerware is defined as invasive applications running on computers and smartphones that basically send every bit of information about you to another person. This covers the gamut from programs that can be purchased online to give third parties access to basically everything on your computer from photos, text messages and emails to individual keystrokes, to apps that activate your Mac’s webcam without your knowledge.

I’m not really seeing the difference between this definition and “traditional” spyware, but stalkerware as I define it is:

Software that automatically reports your location on a regular basis without your knowledge or consent.

The stalkerware that Motherboard writes about are dedicated programs or apps that someone deliberately installs on a target’s mobile device in order to track their activity for whatever reason.  Stalkerware as I define it is a little different, in that it is not targeted at one individual.  These are applications that are installed on mobile devices that track your every move–literally stalking you–most likely without your knowledge.

1 Comment

Thoughts on Teaching Data Science

A big interest of mine is how to impart what little I know of the tools and techniques of data science to others.  When I was at Booz Allen, I taught numerous classes both for internal staff and for various clients.  I’ve also taught for Metis, O’Reilly Publishing and for the last three years, at BlackHat so I do have some experience in the matter.   I’ve looked at MANY data science programs to see if what they are teaching lines up what I’m teaching and I’d like to share some things which I’ve noticed which will hopefully help you build a better data science program.  My goal here is to share my mistakes and experiences over the years and hopefully if you are building a data science training program, you can learn from what I learned the hard way.  I make no claims to be the perfect data science instructor, and I’ve made plenty of mistakes along the way.

While I’m at it, I’ll put in a plug for an upcoming data science class which I am teaching with Jay Jacobs of BitSight Security at the O’Reilly Security Conference in NYC, October 29-30th.

Really, data science instruction is an optimization problem: as an instructor, your goal is to minimize confusion whilst maximizing understanding.  To do this, you must remove as many obstacles as possible from the students’ path which create dissonance.  This may seem silly, but I have observed that if you have small errata in your code, or your code doesn’t work on their machine, even due to something they did, it significantly detracts from their learning experience and their opinion of you as an instructor.  Therefore, removing all these obstacles to understanding is vital to your success as an instructor.

Leave a Comment

The Difference between Software Development and Data Science

I am fortunate enough to get regular messages from recruiters on LinkedIn asking to speak with me about software development jobs.  Here’s the thing… I’m not a software developer, I do data science and data analytics.  For the last seven years, my job title has included the words “data” and “scientist” in the title.  I have never held a position with the words “Software” and “Developer” in the title.  I have taught and am currently teaching classes with titles such as “Data Science for Security Professionals” and “Applied Data Science for Security”.   All of this is on my LinkedIn profile, yet despite this, the messages continue.

On some level, it makes sense.  If you look at my resume, you’d see that I have a degree in computer science, experience with various coding languages, and projects on github.  Hell, I’m a committer for Apache Drill…

So what’s the difference between a data scientist and software developer?

5 Comments

Tutorial: Visualizing Machine Learning Models

One of the big issues I’ve encountered in my teaching is explaining how to evaluate the performance of machine learning models.  Simply put, it is relatively trivial to generate the various performance metrics–accuracy, precision, recall, etc–if you wanted to visualize any of these metrics, there wasn’t really an easy way to do that.  Until now….

Decision Boundary

Recently, I learned of a new python library called YellowBrick, developed by Ben Bengfort at District Data Labs, that implements many different visualizations that are useful for building machine learning models and assessing their performance.   Many visualization libraries require you to write a lot of “boilerplate” code:  IE just generic and repetitive code, however what impressed me about YellowBrick is that it largely follows the scikit-learn API, and therefore if you are a regular user of scikit-learn, you’ll have no problem incorporating YellowBrick into your workflow.  YellowBrick appears to be relatively new, so there still are definitely some kinks to be worked out, but overall, this is a really impressive library.

Leave a Comment

Tip of the Day: How I reclaimed 10GB of Hard Disk Space on my MacBook Pro

I love my MacBook Pro. Quite honestly, it’s the best laptop I’ve ever owned. However, my one regret is not buying the larger hard drive. Anyway, over the last few months, I’ve noticed that my free disk space kept on shrinking. I did all the usual stuff, deleted unneeded applications, ran various disk cleaning tools, etc until finally, I hit the motherlode… I discovered that brew, everyone’s favorite package manager was archiving old versions every time you ran brew update!!

To fix this problem… simply run: brew cleanup. I did this and voila! 10 GB of hard disk space cleaned up!

2 Comments