Skip to content

The Dataist Posts

Drill UDF List

drillLogoI’ve been working on developing some custom functions for Drill, or User Defined Functions and I realized that there really should be a repository for Drill UDFs.  I’ve decided to create a page with links to all the UDFs that I know of.  I’ll keep this updated, so please if you have Drill UDFs that you want to share, please email me a link and I’ll put it up.

Share the joy
Leave a Comment

Fixing STEM Education

To both of my loyal readers, I apologize for not writing anything in a while, but I have been absolutely slammed with classes and conference presentations.  Anyway, I’ve been doing a lot of thinking about my earlier post about Teaching Data Science in English.   The post provoked a decent response, mostly positive.

One reader sent me the following comment about my post which I’ve decided to quote (with permission) in its entirety because I think it accurately reflects why people get so frustrated when they try to learn mathematical concepts. What interested me was that this individual took action and “translated from mathspeak to English” and all of a sudden she was able to understand the underlying concepts.

Awhile ago I read a piece you had written on LinkedIn about making ‘mathspeak’ and ‘techspeak’ (i.e. coding) more accessible to regular people, by decreasing mathematical notation usage and increasing the use of real words in explanations of formulas and concepts. It was something that stayed with me because I’ve always understood broader mathematical concepts but have always had trouble with the mechanics, and I think a lot of that has had to do with the amount of notation used…math seems like a foreign language sometimes, and there are 2 levels of understanding: the first is merely deciphering the ‘foreign language’, which already puts me out of my comfort zone (think reading Spanish or French if you are a native English speaker) and then understand the underlying concepts, which becomes harder due to the fact that it’s written in a ‘non-native’ language. Recently I’ve started taking an online course in machine learning on XXX. Already in the second lesson, he dove straight into notation-filled formulas, and I was starting to get that overwhelmed feeling that I’m familiar with from previous years of math. But I had what you wrote in my mind, and I thought I’d give it a shot and manually ‘translate’ the formulas and equations into English, and stick with that. Well, I did that, and it worked so well. I feel that I am able to follow along with the underlying theory of the class and by extension, the formulas and algorithms he presented in ‘mathese’ whereas before I would have shut-down and assumed it was beyond my grasp. Thanks so much for highlighting this aspect of the math/English understanding divide. It is continuously helpful for me. (Emphasis mine)

I’d like to share another related story.  One of my first paying jobs was working for KUAT public television as the web developer (www.kuat.org) and I wanted to do some things that required automating a data flow from an archaic DOS based database.  I was teamed up with a programmer who helped me build the process and in doing so, I learned how to write regular expressions.  I got so into it, I nearly automated myself out of a job.

Fast forward a year or so, when I was nearly done with my CS degree, I had to take an upper level CS course about Automata, Grammars and Languages, which included regular expressions in the course description.  I was pretty excited because by this point, I had become a master at regular expressions and was looking forward to a class that I knew some of the material going in.  Boy was I in for a shock.  When we got to the regular expressions section, it degenerated into a plethora of Greek letters and assorted jargon to the point where I truly loathed going to class.

Theory Should Not Be Taught at the Expense of Application

What I also realized in that CS class was that most of my fellow students may have passed the tests, they did not have any clue how to use regular expressions in real life, or why you would want to use them in the first place.  While we were spending time writing expressions that match ‘aaaaaaabababaaaaa‘ and drawing the automata that “implement” that, the knowledge of how to apply this to a real life problem, such as extracting data artifacts from raw data, was completely lost on the class.

What if the instructor had started the class by showing us this:

pattern = '([a-zA-Z0-9_.]+)@([a-zA-Z0-9_.]+\.\w{2,3})'
matchObj = re.match( text, pattern )
if matchObj:
email = matchObj.groups(0)
account = matchObj.groups(1)
domain = matchObj.groups(2)

If you’re not familiar, this brief example in python-esque pseudo code demonstrates how to match, and extract email addresses, accounts and domains from text.

I don’t think I’m saying anything new here, but too many technical classes both in academia and out, spend a disproportionate amount of time on the underlying theory, whilst simultaneously ignoring, or downplaying the actual application of the concepts being taught.  The result is that many students walk away frustrated, not understanding the actual use of what they are learning, and while professors and instructors may pat themselves on the back for preserving the “purity” of their curricula, I would argue that they have utterly failed in their task of educating their students.

The bottom line here, is that some people are really interested in theory, however for knowledge to be translated into something useful, students should be exposed early and often to a theory’s application and in conclusion, if you are designing some STEM training or a classes at a university don’t forget the importance of demonstrating how to apply the concepts you are teaching.

Share the joy
Leave a Comment

Conference Reflections Part 1: Open Data Science Conference East

My employer is amazing and in the last two months, they’ve allowed me to attend a lot of data science conferences and I thought I’d share some general reflections on my experiences.

Open Data Science Conference: A Great Value for New Comers

UnknownI gave two presentations this year at Open Data Science Conference (ODSC) East and I just wanted to put it out there that if you are new to data science or are just interested in learning more about data science, then ODSC is a really great venue to meet incredibly talented individuals as well as attend high quality technical talks. Continue reading Conference Reflections Part 1: Open Data Science Conference East

Share the joy
Leave a Comment

Something is Rotten in the State of Data

I’m writing this blog post in the departure lounge at Heathrow, on my way back from Strata + Hadoop World, London.  Whilst at Strata, speakers kept coming back to the idea that an ever growing number of large businesses are not really happy with the investments they have made in analytics and data science.  One of the speakers quoted a Forrester 2016 survey which claimed that:

  • 29% of firms are good at translating analytics results into measurable business outcomes
  • -20% change in satisfaction with analytics initiatives between 2014 and 2015
  • 50% of firms expecting to see stagnation or a decrease in big data/data lake investments in 2016.

These are very disappointing numbers, however not completely unsurprising.  Using the “5 why” technique an intra-management dialogue might go something like this: Continue reading Something is Rotten in the State of Data

Share the joy
2 Comments

A Social Contract for Data Collection

I just returned from Strata + Hadoop World in San Jose, where I gave a talk entitled “Kosher Collection: Best Practices in Data Handling“.  I really had an amazing time at Strata this year and major kudos to the organizers for putting on a great show.

The central premise of my talk is that in today’s world, there is a social contract between data collectors and consumers.  Essentially the agreement is that consumers give their personal data to a data collector in exchange for mutual benefit.  The problem is that consumers, in general, lack an understanding of the technology as well as data collection and as a result, are unable to provide informed consent.  Furthermore, this issue is likely to be exacerbated in the future as the opportunity to opt-out of mass data collection is disappearing.

Continue reading A Social Contract for Data Collection

Share the joy
1 Comment

The Two Most Important Skills for a Data Scientist

unnamedI saw an article a little while ago on LinkedIn (which at the time of writing I cannot find) but the basic premise of the article was that problem solving was the most important skill for data scientists to be effective at their job.  (If anyone can find the article, please send me a PM as I’d like to credit the author.)   The article stood out to me because most articles that begin with “Top x Skills for a Data Scientist” usually feature some list like this one:

  1. Education
  2. SAS or R
  3. Machine Learning
  4. Advanced Statistics
  5. Python
  6. Hadoop
  7. SQL
  8. Unstructured Data
  9. Intellectual Curiosity
  10. Business Acumen
  11. Communication Skills

While many people like to focus on various technical aspects of data science, such as data engineering, or machine learning, at the end of the day, if you are unable to apply these skills in a practical way towards real life problems, you will not be an effective data scientist.  In my view, effective data scientists must be masters at both problem solving and critical thinking.  There are clear relationships between the two but I think it is fair to say that there are differences in that not all problems require critical thinking and vice versa.

Data Science is the Union of Critical Thinking and Problem Solving

To illustrate this union, there is a famous example from WWII.  In an effort to minimize the number of bombers shot down over Europe, the Center for Naval Analyses (CNA) had conducted a study of the damage done to aircraft that had returned from missions, and had recommended that armor be added to the areas that showed the most damage.  This would seem like a relatively easy optimization problem…put more armor where there are a lot of bullet holes.  Indeed, this was the conclusion at which the CNA arrived.

However, a mathematician named Dr. Avraham Wald was given the same data consisting of how bullet holes were distributed across aircraft which returned from sorties and realized that there was a problem with these conclusions.  Dr. Wald realized that this study only considered aircraft which had returned from sorties and not those which had been shot down and therefore, the correct conclusion was to put additional armor where there was little damage.  In effect, the bullet holes demonstrated that the aircraft could take damage in those areas and still fly, whereas areas where there were few bullet holes demonstrated the opposite.   The Allied forces adopted this conclusion and significantly improved their survival rates.

To me this demonstrates the perfect blend of problem solving and critical thinking.  The original analysis, was simple problem solving.  Someone was tasked with figuring out where to put more armor.   They did some analysis, came up with an answer and–in their eyes at least–solved the problem.  Unfortunately, the conclusion was completely incorrect.  The critical thinking came in when Dr. Wald asked the question of “What is missing in this data?” (Or most likely “Was in dieser Daten fehlt?“) and ultimately led Dr. Wald to the correct conclusion.

Applying Critical Thinking to Data Science

My first real introduction to critical thinking came not in the university, but rather during my first few months working as an intelligence analyst at the Central Intelligence Agency.  At the time, after about four months, new analysts would be pulled out of their offices and enrolled in a four month long, analytic training program.  I can honestly say that I learned more about analysis in this program, than I did in my entire college career.   (Unfortunately, this says a lot about the state of higher education, but we’ll leave that for another time)  While unfortunately, I cannot get into the details, some of the most interesting lessons centered around examining intelligence failures, and how when critical thinking is replaced by “group-think” or other cognitive biases, the results are usually not good.  The course forced you to look at situations with a critical eye, examining all assumptions and putting hypotheses through an extremely high degree of rigor.  I believe that the thought processes and critical thinking skills I learned at the Agency are what enabled me to approach data problems with a unique perspective and as a result, come up with effective solutions.

Many people ask me how they can become a data scientist, and while I can recommend various courses that will help you develop your technical skills, the application of these skills–and ultimately your success as a data scientist–ultimately depends on your critical thinking and problem solving abilities.  Can you look at an analysis and find the logical flaws?  Can you think critically about problems?  In the classes which I teach, I always try to infuse critical thinking into the exercises so that students must think about their answers and ask things like “Does this answer make sense?” etc.  (Naturally, I’ll put in a plug for my upcoming classes at BlackHat in Las Vegas in August: Crash Course in Data Science for Hackers, and the Crash Course in Machine Learning)

Learning Critical Thinking Skills

Hopefully, by now I’ve convinced you of the need for critical thinking to be a part of a data scientist’s toolkit. You might be asking yourself, how do I hone this skill?  Well… stay tuned, as that sounds like the subject of my next post!

Share the joy
3 Comments

On Data Science is Not A Binary Condition

There are a lot of bytes floating about the internet making the case for who is a data scientist, including a whole subset of posts about who are “real” data scientists.  I have observed that this philosophy manifests itself in data science training programs as well.  Right now, if you want to become a data scientist, it’s an all-or-nothing proposition.  I would place the available programs in three categories:

  1.  Online Training from Companies such as Coursera, Udacity and others
  2. 12-14 week bootcamp-style data science courses
  3. University programs

Many of these programs look amazing and if I had the time I would love to take one (or all) of them.  With that said, all these programs take the approach that the students will emerge from these programs as data scientists and as result, they teach a comprehensive set of skills to their students.

A Different Approach

There are certain professions in which you either are a member or not, and non-members practicing the profession could have disastrous consequences.  Medicine for instance.  You either are a medical doctor or you are not, and when individuals who are not qualified to practice medicine are dispensing medical advice, the results are often disastrous–as best illustrated by the anti-vaccine movement which at this point is being pushed by many celebrities rather than legitimate medical professionals.  Engineering would be another example.  You wouldn’t want to drive over a bridge that I designed because well… I have no idea how to do it.

However, these professions all have professional associations and government regulators who determine who has the requisite skills to call themselves members of that profession.  This does not exist for data science, for good reason.  Data Science as practiced today, is a mixture of several other disciplines, and can be applied to nearly any other discipline.  The breadth of skills that fall under the data science umbrella is so enormous.

In 2013, O’Reilly published a short booklet entitled Analyzing the Analyzers in which they identified four main groupings of individuals who considered themselves data scientists which were:

  • Screen Shot 2016-02-29 at 11.47.44Data Creative
  • Data Developer
  • Data Researcher
  • Data Businessperson

You can see from the graphic on the left that the skill breakdown is far from homogenous.  Data Science differs from other professions in other professions–such as medicine–practitioners MUST have a core set of knowledge in order to be a member of that profession.  As illustrated in the chart on the left, that is not really the case for data science.  One could be a machine learning master, yet have no experience with big data and still legitimately be said to be doing data science work.  Therefore, I’d like to propose that data science be viewed as a spectrum of skills rather than a binary condition.

In short, I believe that data scientists spill too much ink trying to label individuals as individuals (or as “fake” data scientists). Instead data scientists should spend time determining what skills actually make up data science.

Teaching Data Science Skills Rather than Data Science

This slight twist on the approach has implications for training.  If data science is no longer an exclusive club, but rather a collection of skills, those skills, or portions thereof can be taught to individuals who have no interest in becoming data scientists.  In other words, data science training could be about teaching anyone who does analytic work data science skills which they can incorporate into their workflow.  The goal of this kind of training would not be to mint full data scientists, but rather, teach individuals data science skills relevant to their profession.  This kind of training could be a lot shorter and less comprehensive, but in the end, I believe that it would be more practical for the thousands of individuals out there seeking to incorporate data science into their work but don’t really have the time to put into it or the desire to become data scientists.  Mind you, I am not proposing dumbing data science down, but rather I am suggesting that in addition to what is currently offered, data science training can be effective if it is tailored to specific audiences, and focuses on the techniques that would be directly relevant to those audiences.

TL;DR

The data science world today views data science as a binary condition: either you are or not a data scientist.  However data science should be viewed as a collection of skills which virtually any professional can incorporate into their workflow.  If data science training was viewed in this context, organizations could increase their use of data science by training their current staff in the data science skills that are relevant to their work.

Share the joy
Leave a Comment

Teaching Data Science in English (not in Math)

chalkboardI spend most of my time now teaching others about data science and as such I do a lot of research into what is going on with respect to data science education.  As such I decided to take an online machine learning course and it led me to a serious question: why don’t we use pseudo-code to teach math concepts?

Consider the following:
34bd2b1ce9d35d34c115548ad24846fc

 

 

This is the formula for Residual Sum of Squares, which if you aren’t familiar, is a metric used to measure the effectiveness of regression models.

Now consider the following pseudo-code:

residuals_squared = (actual_values - predictions) ^ 2
RSS = sum( residuals_squared )

This example expresses the exact same concept and while it does take up more space on the page, in my mind at least, is much easier to understand.  I don’t have any empirical data to back this up, but I would suspect that many of you would agree.

Greek Letters are Jargon

Another thing I’ve realized is that part of the reason math becomes so difficult for people is that it is entirely taught in jargon, shorthand, and shorthand for shorthand.  The greek letter sigma represents a sum, but if you don’t know that then it represents confusion.  If you aren’t familiar with this formula, then the other Greek letters could be meaningless, yet if we used pseudocode, any part of this formula could be rewritten using English words (or any other language) and thus easily understood by anyone.

Crash Course in Machine Learning

I’m working on developing a short course in Machine Learning called Crash Course in Machine Learning which I will be teaching at the BlackHat conference in August.  I’m curious as to what people think about presenting algorithms using pseudo-code instead of math jargon.  I suspect it will make it easier for people to understand without diluting the rigor.

Share the joy
4 Comments

Data Science Classes at BlackHat 2016!

I’m very pleased to announce that this year, my team and I got two classes accepted for the BlackHat conference in Las Vegas!  I believe that data science and machine learning have a huge role to play in infosec/cybersecurity, and in a way, it really is a domain which is crying out for data science to be used.  There are ever expanding amounts of data , the actors are becoming more sophisticated, and the security professionals are almost always strained for resources.  Our classes won’t turn you into data scientists, but you will learn how to directly apply data science techniques to cybersecurity.  If this sounds interesting to you, please check out our Crash Course in Data Science and our Crash Course in Machine Learning.  Both are two day classes and will be offered from July 30-31st and Aug 1-2, 2016.

Crash Course in Data Science (for Hackers)

Crash Course in Data ScienceThis interactive course will teach network security professionals how to use data science techniques to quickly write scripts to manipulate and analyze network data. Students will learn techniques to rapidly write scripts to improve their work. Participants will learn now to read in data in a variety of common formats then write scripts to analyze and visualize that data. A non-exhaustive list of what will be covered include:

  • How to write scripts to read CSV, XML, and JSON files
  • How to quickly parse log files and extract artifacts from them
  • How to make API calls to merge datasets
  • How to use the Pandas library to quickly manipulate tabular data
  • How to effectively visualize data using Python
  • How to apply simple machine learning algorithms to identify potential threats

Finally, we will introduce the students to cutting edge Big Data tools including Apache Spark and Apache Drill, and demonstrate how to apply these techniques to extremely large datasets.

Crash Course in Machine Learning (for hackers):

ccmlThis interactive course will teach network security professionals machine learning techniques and applications for network data. This course is a continuation of the skills taught in the Crash Course in Data Science for Hackers. Students will learn various machine learning methods, applications, model selection, testing, and interpretation. Participants will write code to prepare and explore their data and then apply machine learning methods for discovery.

A non-exhaustive list of what will be covered include:

  • Machine Learning Introduction and Terminology
  • Foundations of Statistics
  • Python Machine Learning Packages Introduction
  • Data Exploration and Presentation
  • Supervised Learning Methods
  • Unsupervised Learning Methods
  • Model Selection and Testing
  • Machine Learning Applications for Network Data
Share the joy
Leave a Comment

Data Driven Security Podcast

ddspcI recently had the opportunity to speak on the Data Driven Security Podcast with Jay Jacobs and Bob Rudis about data science training.  You can listen to the podcast here.

To underscore a few points from the interview:

Data science is not a binary condition.  Many people with whom I have spoke, or read, talk about “real” data science and/or “fake” data scientists.  Unlike medicine, or law, in data science one need not be a “data scientist” to employ data science in one’s work.   In practical terms, this means that data science can be viewed as a spectrum of skills which can range from beginner to expert, and most  importantly, you don’t need to be a “real data scientist” to use data science techniques.  In fact, it is my opinion that in the next few

When designing training sessions for working professionals, I try to approach them with that in mind and build courses that teach the thought process behind data science, as well as practical skills which students can directly apply to their jobs.  The objective of the classes are not to convert students into data scientists, but again, to teach useful data science skills which are relevant to their work.

If you view training development where the goal is to teach a professional a series of relevant skills instead of a new discipline, that translates into developing short, focused classes rather than lengthy bootcamps.

Data Science is more than Machine Learning

I’ve reviewed a lot of data science courses, and many focus very heavily on machine learning and statistics.  While this is certainly an important aspect of data science, study after study shows that data scientists spend 50-90% of their time doing data preparation and cleansing.  With that in mind, when designing courses, I try to spend a decent amount of time on data wrangling techniques.

Anyway, please listen to the podcast here and enjoy!  Questions/comments are welcome!

Share the joy
2 Comments