Skip to content

The Dataist Posts

I took the #DeleteFacebook Challenge

In the last weeks, Facebook has been in the news a lot for its aggressive data gathering.  What has surprised me, is not that Facebook is in the news, but that it hasn’t happened much sooner.  Facebook is possibly the most invasive data gathering, privacy invading platform the world has ever seen, despite the fact that it is cloaked behind a veil of childish logos and thumbs up buttons.  Additionally, Facebook has engaged in some truly abhorrent practices, such as gathering text messaging and phone metadata from Android usersconducting secret psychological tests on over 700,000 users in 2012, ad programs that track users’ web activity off of Facebook, to say nothing of how Facebook was and most likely is being used to propagate fake news.

As someone who has worked in various regulated industries (banking, government) it appalls me how companies like Facebook abuse their users’ privacy.  My biggest issue is that Facebook disguises its data gathering efforts under a slick veneer of innocence which disguises their true intent.  Much like tobacco adverts of yore, Facebook and its “family” are targeted primarily towards younger people who don’t understand what they are giving up in exchange for the privilege of sharing their photos with their friends.

An extremely egregious example of this occurs on election days in the US.  Facebook will ask users a question: “Did you vote today?” and give you a little sticker on your profile if you answer that you did.  Now why do you think they would do that?  To encourage people to vote?  Hardly, though that may be a side benefit.  No, the real reason they do this is to gather information about people’s voting history, which Facebook then uses in their targeted political campaigns.  Don’t believe me?  You can read about it here: https://politics.fb.com.

The problem here is that Facebook doesn’t ask their users for consent in a way that a typical user will understand.  I am not trying to mock Facebook users, but most people who don’t work in data analytics, don’t really understand the implications of mass data gathering.  The image above is how Facebook Messenger asks for permission to gain access to your contacts, SMS and phone call logs. (Courtesy of ArsTechnica)  Nowhere in this image does it say anything about collecting SMS, phone logs or anything for that matter.  It looks cute and most people wouldn’t think twice about clicking on ok.

Silicon Valley’s Culture Needs to Change

The biggest issue I have with some of what Facebook has been caught doing is that enough of the company felt it was acceptable for them to do it.   That’s the bigger issue here.  Most likely, some manager at Facebook decided, why don’t we gather all our Android users’ text data and mine it!  And nobody said a bloody thing. No leaks to the news media, no disgruntled employees writing blog posts about it, nothing….  Which ultimately means that everyone involved felt it was totally acceptable to take their users’ SMS and phone logs.   This practice only ended when Android disabled the functionality, so it wasn’t as if Facebook execs had some crisis of conscious.

But, I’m a realist.  Facebook’s revenue is generated by selling targeted advertising and the way it targets its ads is by gathering data about its audience.  Whilst Mr. Zuckerberg can write pithy non-apologies about it, nothing will change because this is how Facebook makes money.  The only way this changes, is for people like you to get off of Facebook (and Instagram, and WhatsApp) in significant numbers and for advertisers to stop spending money on Facebook ads. As long as there is a market for this data, the sad reality is that there will be more and more companies trying to invade your privacy and sell it to the highest bidder.

Educate Yourself About How Companies Monetize Your Data

You need to understand how companies are using your data and make a conscious choice about whether that company provides enough value to justify that loss in privacy.  Frankly, this is why I prefer using companies whose primary revenue stream is not derived from data monetization.  This is why I choose to use iPhones instead of Android, iMessage instead of WhatsApp, socializing with real friends instead of Facebook.  You can generally tell this is the case by whether you have to pay for a service.  Generally speaking, companies which charge for their services are not looking to invade your privacy to the same degree as companies that offer their services “for free”.  As the saying goes: “If you aren’t paying for it, YOU are the product.

Share the joy
Leave a Comment

A New Threat: Stalkerware

What would you do if you attended a political event or protest and the next day, you receive targeted adverts for that political cause?  Would that be cause for concern?  After all, you don’t post about your political views, how did the advertisers know?  You didn’t sign any rosters or register, so how did they know you were there?

I recently became aware of a new category of computer-evil: stalkerware.  I thought I was being clever and would have the privilege of coining a new term, but a few other people have already coined the term.  However, I would like to propose a slightly different definition.  In an article originally appearing on Motherboard, stalkerware is defined as:

Stalkerware is defined as invasive applications running on computers and smartphones that basically send every bit of information about you to another person. This covers the gamut from programs that can be purchased online to give third parties access to basically everything on your computer from photos, text messages and emails to individual keystrokes, to apps that activate your Mac’s webcam without your knowledge.

I’m not really seeing the difference between this definition and “traditional” spyware, but stalkerware as I define it is:

Software that automatically reports your location on a regular basis without your knowledge or consent.

The stalkerware that Motherboard writes about are dedicated programs or apps that someone deliberately installs on a target’s mobile device in order to track their activity for whatever reason.  Stalkerware as I define it is a little different, in that it is not targeted at one individual.  These are applications that are installed on mobile devices that track your every move–literally stalking you–most likely without your knowledge.

Share the joy
1 Comment

Going Back to BlackHat!

For the last three years, I have had the honor and privilege of teaching a data science class at the BlackHat conference in Las Vegas.  Well, I found out yesterday that I’ll be going back for a fourth year!   Together with my amazing colleague Austin Taylor (@HuntOperator), we will be teaching Applied Data Science and Machine Learning for Cyber Security.  It turns out that this is the only class at BlackHat this year about data science or machine learning!

Teaching a class at BlackHat is really a great experience, and quite terrifying at the same time.  You’re presenting a class to the best of the best in security, so you really have to know your stuff.  From my experience, the students are really on top of their game so it makes for very interesting and engaging sessions.

What’s New for This Year

This year’s class I have to say, will be the best one yet.  We’ve developed a lot of new material including a lesson about improving the performance of models, beaconing detection with Austin’s Flare library, anomaly detection with K-Means clustering and more.  I’ll be posting more about the course as we get closer to the event, but if you have any questions or requests, please let me know!  If you’re interested, don’t wait, register now!

 

 

Share the joy
Leave a Comment

Home Automation Update

One of my most popular posts is a tutorial I wrote two years ago about automating a gas fireplace and I get a lot of questions about home automation, so thought I’d write an update to that tutorial and review some products I’ve bought in the last two years.

My Original Goal

When I originally started seeing all the home automation products that were coming out, my original interest in them was to see what kind of data these devices gathered about their owners and I ended up giving two presentations on this topic at the Strata Conferences in New York and London.  With all that said, for my research I wanted to be able to control the functionality of my home with my phone.  After buying a bunch of devices, I was really disappointed.  At the time, I wasn’t able to automate any lighting because my home was built in the 1920s and none of the light switches had a neutral wire–a requirement for Z-Wave switches.  After we moved to a newer home, and I started automating lights and installing other automation devices, the thing that really frustrated me was the difficulty in getting all the devices to work together.  I had high hopes for IFTTT, (and still do) but at the time, it seemed like it was a half-assed workaround.

I also was very disappointed with the available security systems.  At the time, the choices were pretty much limited to systems which required you to pay rather steep monthly “monitoring fees” to some company and use rather low-tech devices, or half-baked products that did appear promising but seemed to be MVPs at best.

So what changed?  I can answer that in one word: Alexa.

Share the joy
1 Comment

Thoughts on Teaching Data Science

A big interest of mine is how to impart what little I know of the tools and techniques of data science to others.  When I was at Booz Allen, I taught numerous classes both for internal staff and for various clients.  I’ve also taught for Metis, O’Reilly Publishing and for the last three years, at BlackHat so I do have some experience in the matter.   I’ve looked at MANY data science programs to see if what they are teaching lines up what I’m teaching and I’d like to share some things which I’ve noticed which will hopefully help you build a better data science program.  My goal here is to share my mistakes and experiences over the years and hopefully if you are building a data science training program, you can learn from what I learned the hard way.  I make no claims to be the perfect data science instructor, and I’ve made plenty of mistakes along the way.

While I’m at it, I’ll put in a plug for an upcoming data science class which I am teaching with Jay Jacobs of BitSight Security at the O’Reilly Security Conference in NYC, October 29-30th.

Really, data science instruction is an optimization problem: as an instructor, your goal is to minimize confusion whilst maximizing understanding.  To do this, you must remove as many obstacles as possible from the students’ path which create dissonance.  This may seem silly, but I have observed that if you have small errata in your code, or your code doesn’t work on their machine, even due to something they did, it significantly detracts from their learning experience and their opinion of you as an instructor.  Therefore, removing all these obstacles to understanding is vital to your success as an instructor.

Share the joy
2 Comments

The Difference between Software Development and Data Science

I am fortunate enough to get regular messages from recruiters on LinkedIn asking to speak with me about software development jobs.  Here’s the thing… I’m not a software developer, I do data science and data analytics.  For the last seven years, my job title has included the words “data” and “scientist” in the title.  I have never held a position with the words “Software” and “Developer” in the title.  I have taught and am currently teaching classes with titles such as “Data Science for Security Professionals” and “Applied Data Science for Security”.   All of this is on my LinkedIn profile, yet despite this, the messages continue.

On some level, it makes sense.  If you look at my resume, you’d see that I have a degree in computer science, experience with various coding languages, and projects on github.  Hell, I’m a committer for Apache Drill…

So what’s the difference between a data scientist and software developer?

Share the joy
9 Comments

Tutorial: Visualizing Machine Learning Models

One of the big issues I’ve encountered in my teaching is explaining how to evaluate the performance of machine learning models.  Simply put, it is relatively trivial to generate the various performance metrics–accuracy, precision, recall, etc–if you wanted to visualize any of these metrics, there wasn’t really an easy way to do that.  Until now….

Decision Boundary

Recently, I learned of a new python library called YellowBrick, developed by Ben Bengfort at District Data Labs, that implements many different visualizations that are useful for building machine learning models and assessing their performance.   Many visualization libraries require you to write a lot of “boilerplate” code:  IE just generic and repetitive code, however what impressed me about YellowBrick is that it largely follows the scikit-learn API, and therefore if you are a regular user of scikit-learn, you’ll have no problem incorporating YellowBrick into your workflow.  YellowBrick appears to be relatively new, so there still are definitely some kinks to be worked out, but overall, this is a really impressive library.

Share the joy
Leave a Comment

Tip of the Day: How I reclaimed 10GB of Hard Disk Space on my MacBook Pro

I love my MacBook Pro. Quite honestly, it’s the best laptop I’ve ever owned. However, my one regret is not buying the larger hard drive. Anyway, over the last few months, I’ve noticed that my free disk space kept on shrinking. I did all the usual stuff, deleted unneeded applications, ran various disk cleaning tools, etc until finally, I hit the motherlode… I discovered that brew, everyone’s favorite package manager was archiving old versions every time you ran brew update!!

To fix this problem… simply run: brew cleanup. I did this and voila! 10 GB of hard disk space cleaned up!

Share the joy
2 Comments

Announcing the First Release of Griffon: A Virtual Environment for Data Science

My colleagues Austin Taylor and Melissa Kilby are proud to announce the first stable release of Griffon:  A Virtual Machine for Data Science.   Griffon is a virtual machine which contains many data science tools pre-configured, installed and linked up to make it so that you don’t have to be a Linux expert to try them out.  If you are teaching a class, or if you are simply wanting to learn more about a particular tool, then Griffon is perfect for you.

You can download Griffon here: https://github.com/gtkcyber/griffon-vm.

Share the joy
Leave a Comment