Skip to content

The Dataist Posts

On Data Science is Not A Binary Condition

There are a lot of bytes floating about the internet making the case for who is a data scientist, including a whole subset of posts about who are “real” data scientists.  I have observed that this philosophy manifests itself in data science training programs as well.  Right now, if you want to become a data scientist, it’s an all-or-nothing proposition.  I would place the available programs in three categories:

  1.  Online Training from Companies such as Coursera, Udacity and others
  2. 12-14 week bootcamp-style data science courses
  3. University programs

Many of these programs look amazing and if I had the time I would love to take one (or all) of them.  With that said, all these programs take the approach that the students will emerge from these programs as data scientists and as result, they teach a comprehensive set of skills to their students.

A Different Approach

There are certain professions in which you either are a member or not, and non-members practicing the profession could have disastrous consequences.  Medicine for instance.  You either are a medical doctor or you are not, and when individuals who are not qualified to practice medicine are dispensing medical advice, the results are often disastrous–as best illustrated by the anti-vaccine movement which at this point is being pushed by many celebrities rather than legitimate medical professionals.  Engineering would be another example.  You wouldn’t want to drive over a bridge that I designed because well… I have no idea how to do it.

However, these professions all have professional associations and government regulators who determine who has the requisite skills to call themselves members of that profession.  This does not exist for data science, for good reason.  Data Science as practiced today, is a mixture of several other disciplines, and can be applied to nearly any other discipline.  The breadth of skills that fall under the data science umbrella is so enormous.

In 2013, O’Reilly published a short booklet entitled Analyzing the Analyzers in which they identified four main groupings of individuals who considered themselves data scientists which were:

  • Screen Shot 2016-02-29 at 11.47.44Data Creative
  • Data Developer
  • Data Researcher
  • Data Businessperson

You can see from the graphic on the left that the skill breakdown is far from homogenous.  Data Science differs from other professions in other professions–such as medicine–practitioners MUST have a core set of knowledge in order to be a member of that profession.  As illustrated in the chart on the left, that is not really the case for data science.  One could be a machine learning master, yet have no experience with big data and still legitimately be said to be doing data science work.  Therefore, I’d like to propose that data science be viewed as a spectrum of skills rather than a binary condition.

In short, I believe that data scientists spill too much ink trying to label individuals as individuals (or as “fake” data scientists). Instead data scientists should spend time determining what skills actually make up data science.

Teaching Data Science Skills Rather than Data Science

This slight twist on the approach has implications for training.  If data science is no longer an exclusive club, but rather a collection of skills, those skills, or portions thereof can be taught to individuals who have no interest in becoming data scientists.  In other words, data science training could be about teaching anyone who does analytic work data science skills which they can incorporate into their workflow.  The goal of this kind of training would not be to mint full data scientists, but rather, teach individuals data science skills relevant to their profession.  This kind of training could be a lot shorter and less comprehensive, but in the end, I believe that it would be more practical for the thousands of individuals out there seeking to incorporate data science into their work but don’t really have the time to put into it or the desire to become data scientists.  Mind you, I am not proposing dumbing data science down, but rather I am suggesting that in addition to what is currently offered, data science training can be effective if it is tailored to specific audiences, and focuses on the techniques that would be directly relevant to those audiences.

TL;DR

The data science world today views data science as a binary condition: either you are or not a data scientist.  However data science should be viewed as a collection of skills which virtually any professional can incorporate into their workflow.  If data science training was viewed in this context, organizations could increase their use of data science by training their current staff in the data science skills that are relevant to their work.

Share the joy
Leave a Comment

Teaching Data Science in English (not in Math)

chalkboardI spend most of my time now teaching others about data science and as such I do a lot of research into what is going on with respect to data science education.  As such I decided to take an online machine learning course and it led me to a serious question: why don’t we use pseudo-code to teach math concepts?

Consider the following:
34bd2b1ce9d35d34c115548ad24846fc

 

 

This is the formula for Residual Sum of Squares, which if you aren’t familiar, is a metric used to measure the effectiveness of regression models.

Now consider the following pseudo-code:

residuals_squared = (actual_values - predictions) ^ 2
RSS = sum( residuals_squared )

This example expresses the exact same concept and while it does take up more space on the page, in my mind at least, is much easier to understand.  I don’t have any empirical data to back this up, but I would suspect that many of you would agree.

Greek Letters are Jargon

Another thing I’ve realized is that part of the reason math becomes so difficult for people is that it is entirely taught in jargon, shorthand, and shorthand for shorthand.  The greek letter sigma represents a sum, but if you don’t know that then it represents confusion.  If you aren’t familiar with this formula, then the other Greek letters could be meaningless, yet if we used pseudocode, any part of this formula could be rewritten using English words (or any other language) and thus easily understood by anyone.

Crash Course in Machine Learning

I’m working on developing a short course in Machine Learning called Crash Course in Machine Learning which I will be teaching at the BlackHat conference in August.  I’m curious as to what people think about presenting algorithms using pseudo-code instead of math jargon.  I suspect it will make it easier for people to understand without diluting the rigor.

Share the joy
4 Comments

Data Science Classes at BlackHat 2016!

I’m very pleased to announce that this year, my team and I got two classes accepted for the BlackHat conference in Las Vegas!  I believe that data science and machine learning have a huge role to play in infosec/cybersecurity, and in a way, it really is a domain which is crying out for data science to be used.  There are ever expanding amounts of data , the actors are becoming more sophisticated, and the security professionals are almost always strained for resources.  Our classes won’t turn you into data scientists, but you will learn how to directly apply data science techniques to cybersecurity.  If this sounds interesting to you, please check out our Crash Course in Data Science and our Crash Course in Machine Learning.  Both are two day classes and will be offered from July 30-31st and Aug 1-2, 2016.

Crash Course in Data Science (for Hackers)

Crash Course in Data ScienceThis interactive course will teach network security professionals how to use data science techniques to quickly write scripts to manipulate and analyze network data. Students will learn techniques to rapidly write scripts to improve their work. Participants will learn now to read in data in a variety of common formats then write scripts to analyze and visualize that data. A non-exhaustive list of what will be covered include:

  • How to write scripts to read CSV, XML, and JSON files
  • How to quickly parse log files and extract artifacts from them
  • How to make API calls to merge datasets
  • How to use the Pandas library to quickly manipulate tabular data
  • How to effectively visualize data using Python
  • How to apply simple machine learning algorithms to identify potential threats

Finally, we will introduce the students to cutting edge Big Data tools including Apache Spark and Apache Drill, and demonstrate how to apply these techniques to extremely large datasets.

Crash Course in Machine Learning (for hackers):

ccmlThis interactive course will teach network security professionals machine learning techniques and applications for network data. This course is a continuation of the skills taught in the Crash Course in Data Science for Hackers. Students will learn various machine learning methods, applications, model selection, testing, and interpretation. Participants will write code to prepare and explore their data and then apply machine learning methods for discovery.

A non-exhaustive list of what will be covered include:

  • Machine Learning Introduction and Terminology
  • Foundations of Statistics
  • Python Machine Learning Packages Introduction
  • Data Exploration and Presentation
  • Supervised Learning Methods
  • Unsupervised Learning Methods
  • Model Selection and Testing
  • Machine Learning Applications for Network Data
Share the joy
Leave a Comment

Data Driven Security Podcast

ddspcI recently had the opportunity to speak on the Data Driven Security Podcast with Jay Jacobs and Bob Rudis about data science training.  You can listen to the podcast here.

To underscore a few points from the interview:

Data science is not a binary condition.  Many people with whom I have spoke, or read, talk about “real” data science and/or “fake” data scientists.  Unlike medicine, or law, in data science one need not be a “data scientist” to employ data science in one’s work.   In practical terms, this means that data science can be viewed as a spectrum of skills which can range from beginner to expert, and most  importantly, you don’t need to be a “real data scientist” to use data science techniques.  In fact, it is my opinion that in the next few

When designing training sessions for working professionals, I try to approach them with that in mind and build courses that teach the thought process behind data science, as well as practical skills which students can directly apply to their jobs.  The objective of the classes are not to convert students into data scientists, but again, to teach useful data science skills which are relevant to their work.

If you view training development where the goal is to teach a professional a series of relevant skills instead of a new discipline, that translates into developing short, focused classes rather than lengthy bootcamps.

Data Science is more than Machine Learning

I’ve reviewed a lot of data science courses, and many focus very heavily on machine learning and statistics.  While this is certainly an important aspect of data science, study after study shows that data scientists spend 50-90% of their time doing data preparation and cleansing.  With that in mind, when designing courses, I try to spend a decent amount of time on data wrangling techniques.

Anyway, please listen to the podcast here and enjoy!  Questions/comments are welcome!

Share the joy
2 Comments

Let’s Stop Using The “Fake Data Scientist” Label

There was a post on KDNuggets yesterday entitled 20 Questions to Detect Fake Data Scientists  by Andrew Fogg, and after reading the questions, I had to wonder what is “real” data science.  All of the 20 questions in this article focused around statistics/machine learning or data visualization, and even the stats questions seemed to be very focused on particular areas of emphasis.  I would argue that this blog was an excellent example of Mirroring Bias or in other words:  I am a data scientist, and these are all the fundamental skills which I deem important, therefore in order for me to deem you worthy of the title Data Scientist, you must have these skills.

Here are the questions:

  1. Explain what regularization is and why it is useful.
  2. Which data scientists do you admire most? which startups?
  3. How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.
  4. Explain what precision and recall are. How do they relate to the ROC curve?
  5. How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything?
  6. What is root cause analysis?
  7. Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples.
  8. What is statistical power?
  9. Explain what resampling methods are and why they are useful. Also explain their limitations.
  10. Is it better to have too many false positives, or too many false negatives? Explain
  11. What is selection bias, why is it important and how can you avoid it?
  12. Give an example of how you would use experimental design to answer a question about user behavior.
  13. What is the difference between “long” and “wide” format data?
  14. What method do you use to determine whether the statistics published in an article (e.g. newspaper) are either wrong or presented to support the author’s point of view, rather than correct, comprehensive factual information on a specific subject?
  15. Explain Edward Tufte’s concept of “chart junk.”
  16. How would you screen for outliers and what should you do if you find one?
  17. How would you use either the extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
  18. What is a recommendation engine? How does it work?
  19. Explain what a false positive and a false negative are. Why is it important to differentiate these from each other?
  20. Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?  (From 20 Questions to Detect Fake Data Scientists  by Andrew Fogg)

The trouble is that data science is interdisciplinary–a mixture of domain expertise, computer science and applied mathematics.  Therefore, “true” data scientists have expertise in all three disciplines that make up data science.  However these questions completely virtually ignore the domain and computer science disciplines to say nothing about big data, unstructured data etc..  For example, if someone were to ask these questions of an expert in computer vision, that candidate might do poorly because their skills–which are certainly in the realm of data science–do not fall neatly on this list.  Likewise for someone who is an expert in streaming text analytics.

If I were going to construct such a list, I might take about five of these questions and add questions like:

  1. What are the advantages of NoSQL systems compared with traditional databases?
  2. Which tools do you use to manipulate data?  Why?
  3. How do you determine the efficiency of an algorithm?
  4. Explain some common methods for analyzing free text.

However, I would not construct such a list in the first place.  The bottom line is that virtually any data scientist could probably come up with a list that would mis-label other data experts as fakes simply by asking questions about their weak areas.

Data Science is about Solving Problems

Data Science is about extracting useful and actionable information from data.   As such, when I interview people the most important thing for me is the candidate’s problem solving and analytical abilities.  I’ll pick people to interview who have a background which would lead me to believe that they would have experience in the data science realms, and then ask them to solve open ended problems.  My real interest is not whether they arrive at a solution, but rather to see how they think about problems.   A good (or “real”) data scientist will be able to identify the problem and use the skills listed above to solve the problem and therefore there is no need to pepper the candidate with a pop quiz of stats questions.

There is a great talk from Daniel Tunkelang about hiring data scientists in which he discusses his process of hiring data scientists and comes to a similar conclusion.

Let’s use a more positive, less exclusive term

Since data science covers so many areas, I think we can take it as a given that virtually nobody can truly be a master of everything.  Therefore, perhaps instead of using the label “Fake Data Scientist”, I would use the label “Novice Data Scientist”, or “Junior Data Scientist“.  I believe that everyone has the capacity to grow and learn new skills.  We weren’t all born with an understanding of deep learning or Markov chains and if someone lacks certain requisite knowledge, instead of labeling them as a phony, it is a better approach to view that candidate as a beginner who needs additional skills and provide them with suggestions as to how to acquire those skills.

Share the joy
3 Comments

Tips for Debugging Code without F-Bombs – Part 1

Debugging code is a large part of actually writing code, yet unless you have a computer science background, you probably have never been exposed to a methodology for debugging code.  In this tutorial, I’m going to show you my basic method for debugging your code so that you don’t want to tear your hair out.

In Programming Perl, Larry Wall, the author of the PERL programming language said that the attributes of a great programmer are Laziness, Impatience and Hubris:

  • Laziness:  The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful, and document what you wrote so you don’t have to answer so many questions about it. Hence, the first great virtue of a programmer.  (p.609)
  • Impatience:  The anger you feel when the computer is being lazy. This makes you write programs that don’t just react to your needs, but actually anticipate them. Or at least pretend to. Hence, the second great virtue of a programmer. See also laziness and hubris. (p.608)
  • Hubris:  Excessive pride, the sort of thing Zeus zaps you for. Also the quality that makes you write (and maintain) programs that other people won’t want to say bad things about. Hence, the third great virtue of a programmer. See also laziness and impatience. (p.607)

These attributes also apply to how to write good code so that you don’t have to spend hours and hours debugging code.

The Best Way to Avoid Errors is Not to Make Them

Ok… so that seems obvious, but really, I’m asking another question and that is: “How can you write code that decreases your likelihood of making errors?”  I do have an answer for that.   The first thing is to remember is that bugs are easy to find when they are small.  To find bugs when they are small, write code in small chunks and test your code frequently.  If you are writing a large program, write a few lines and test what you have written to make sure it is doing what you think it is supposed to do.  Test often.  If you are writing a script that is 100 lines, it is MUCH easier to find errors if you test your code every 10 lines rather than write the whole thing and test at the end.  The better you get, the less frequently you will need to test, but still test your code frequently.

Good Coding Practices Will Help You Prevent Errors

This probably also seems obvious, but I’ve seen (and written) a lot of code that leaves a lot to be desired in the way of good practices.  Good coding practices mean that your code should be readable and that someone who has never seen your code before should be able to figure out what it is supposed to do.  Now I know a lot of people have the attitude that since they are the only one working on a particular piece of code, then they don’t need to put in comments.  WRONG WRONG WRONG  In response, I would ask you in 6 months, if you haven’t worked on this, would you remember what this code did?   You don’t need to go overboard, but you should include enough comments so that you’ll remember the code’s purpose.

Here are some other suggestions:

  1. Adopt a coding standard and stick to it:  It doesn’t matter which one you use, but pick one and stick to it.  That way, you will notice when things aren’t correct.  Whatever you do, don’t mix conventions, ie don’t have column_total, columnTotal and ColumnTotal as variables in the same script.
  2. Use descriptive variable names:  One of my pet peeves about a lot of machine learning code is that they use X, Y as variable names.  Don’t do that.  This isn’t calculus class.  Use descriptive variable names such as test_data, or target_sales, and please don’t use X, Y, or even worse, i, I, l and L as variable names.
  3. Put comments in your code:  Just do it.
  4. Put one operation per line:  I know that especially in Python and JavaScript, it is fashionable (and “Pythonic”) to cram as many operations onto one line as possible via method chaining.  I personally think in series of steps and it is easier to see the logic (and hence any mistakes) if you have one action per line.

Plan your program BEFORE you write it

I learned this lesson the hard way, but if you want to spend many hours writing code that doesn’t work, when faced with a tough problem, just dive right in and start coding.  If you want to avoid that, get a piece of paper and a pen (or whatever system you like) and:

  1. Break the problem down into the smallest, most atomic steps you can think of
  2. Write pseudo-code that implements these steps.
  3. Look for extant code that you can reuse

Once you’ve found reusable code, and you have a game plan of pseudo code, now you can begin writing your code.  When you start writing, check every step against your pseudo code to make sure that your code is doing what you expect it to do.

Don’t Re-invent the Wheel

Another way to save yourself a lot of time and frustration is to re-use proven code to the greatest extent possible.  For example, Python has a myriad of libraries available at Pypi and elsewhere which really can save you a lot of time.  It is another huge pet peeve of mine to see people writing custom code for things which are publicly available.  This means that before you start writing code, you should do some research as to what components are out there and available.

After all, if I were to ask you if you would rather:

  1.  Use prewritten, pretested and proven code to build your program OR
  2. Write my own code that is unproven, untested and possibly buggy

the logical thing to do would of course be to do the first.

In Conclusion

Great programmers never sit down at the keyboard and just start banging out code without having a game plan and without understanding the problem they are trying to solve.  Hopefully by now you see that the first step in writing good code that you won’t have to debug is to plan out what you are trying to, reuse extant code and test frequently. In the next installment, I will discuss the different types of errors and go through strategies for fixing them.

Share the joy
Leave a Comment

The Case for Generalist Data Scientists

I recently read an article by Daniel Tunkelang entitled Data Scientists: Generalists or Specialists? and it resonated with me.  I’ve been involved with hiring data scientists for some time now and I also get a lot of recruiters contacting me about various data science jobs.  My general observation is that when companies search for data scientists, they tend to use the equation (Machine Learning = Data Science), and tend to play down all the other skills that make up data science, such as creativity, critical thinking, data preparation etc.

Tunkelang writes:

Early days

Generalists add more value than specialists in a company’s early days, since you’re building most of your product from scratch and something is better than nothing. Your first classifier doesn’t have to use deep learning to achieve game-changing results. Nor does your first recommender system need to use gradient-boosted decision trees. And a simple t-test will probably serve your A/B testing needs.


Later stage

Generalists hit a wall as your products mature: they’re great at developing the first version of a data product, but they don’t necessarily know how to improve it. In contrast, machine learning specialists can replace naive algorithms with better ones and continuously tune their systems. At this stage in a company’s growth, specialists help you squeeze additional opportunity from existing systems. If you’re a Google or Amazon, those incremental improvements represent phenomenal value.

So, should you hire generalists or specialists? It really does depend—and the largest factor in your decision should be your company’s stage of maturity. But if you’re still not sure, then I suggest you favor generalists, especially if your company is still in a stage of rapid growth. Your problems are probably not as specialized as you think, and hiring generalists reduces your risk. Plus, hiring generalists allows you to give them the opportunity to learn specialized skills on the job. Everybody wins.

Read the complete post here on O’Reilly.com.  What needs to be noted here is that companies will need more specific skills as their analytics mature and evolve, however in the beginning creativity, competence and critical thinking are most likely the most important skills.  I tend to agree with a lot of what Tunkelang writes, and I do get the sense that a lot of hiring managers believe their projects are a lot more mature and advanced than they really are.  Thoughts?

Share the joy
Leave a Comment

Off Topic: How to Automate Your Gas Fireplace

Home automation is a hobby of mine, and in our new home, I really wanted to automate our Heatilator gas fireplace.  However, this isn’t as straightforward as it might seem, and I really haven’t found any good tutorials out there as to how to do this.  This tutorial will show you how to connect your fireplace to your Wink Hub or any other Z-Wave controller.  I got this working and actually found that it is one of the easier things to automate.  I really like being able to set the fireplace to go on and off on a schedule.

Safety Considerations

Before you start this project, you should be comfortable with working with wiring and electricity.  If you are not, get someone else to do this.  Secondly, you will be working with wires that run near gas lines, so multiply every safety concern by at least a factor of three.  If you don’t know what you are doing, this is not the project to figure it out.  I take no responsibility for any damage or injury that may result from this tutorial.  It goes without saying that BEFORE you start cutting wires, make sure that you have either disconnected all power, or shut off the electricity at the circuit breaker. 

The wisdom of automating a gas fireplace is also debatable, however, I left the manual switch in place so you can always turn off the fireplace the “old fashioned” way using the original switch.

What You Will Need:

Remotec Zwave Dry Contact Fixture ModuleWith all that said, this really isn’t a difficult project to complete in a safe manner.  Here’s what you’ll need:

Share the joy
35 Comments

A Few Exciting Tool Announcements!

I hope everyone is enjoying Thanksgiving!  This week, there were several new developments in terms of data science tools which I would like to highlight.  I am a big believer of staying up to date in terms of what new tools are being developed in that you can make yourself much more efficient by better using the available tools.  Both tools highlighted here represent significant potential in terms of being able to get data more efficiently and being able to more effectively present data.

drillLogoApache Drill Releases version 1.3.

On 23 November, the Drill team released Drill version 1.3.  The complete release notes are available here, but for me, the biggest improvement is the text file header parsing.

In my opinion, one of the things Drill did very poorly in previous versions was CSV parsing.  In prior versions, when you used Drill to query a CSV file, Drill would store each row into an array called Columns, and if you were querying a CSV file in Drill you had to use the columns array and assign each column a name:

SELECT columns[0] AS firstName, columns[1] as lastName
FROM cp.`somefile.csv`

This clearly was a less than optimal solution and results in very convoluted queries.  However, with the advent of version 1.3, Drill now can be configured to derive the column names from the original CSV file.  You can still configure drill to operate in the old manner, but I can’t imagine you’d want to, and you can write queries like this for CSV files:

SELECT firstName, lastName
FROM cp`.somefile.csv`

Drill will still work with data that has no headers.  It treats this kind of data as it used to in the past.

The HTTPD log parser still hasn’t made it into a stable version, but I’m following the conversation between the developers closely and it looks like it will be included in version 1.4.

Screen Shot 2015-11-27 at 01.23.06Plot.ly Now Open Source

If you are into data visualization (and what data scientist shouldn’t be?) you’ll be pleased to know that as of a few days ago, the JavaScript library Plot.ly is now completely free and open source.  I teach a lot of data science classes and clearly a subject which we feature in our training is data visualization.  The unfortunate reality that I have encountered is that if you want to create really nice visualizations quickly, you either:

  1. Have to pay a lot of money for BI tools such as Tableau or RShiny.  OR
  2. Learn to code in JavaScript and create them using D3.

It is true that several easy to use libraries such as Bokeh, Seaborn, Vincent and a few others are getting a lot better.  Also Apache Zeppelin is a promising notebook-like tool which enables quick, interactive data visualization, but I digress…

What is Plot.ly and Why Should I Care?

Plot.ly is a JavaScript framework for easily making beautiful interactive visualizations, however you don’t actually have to know JavaScript to use it for visualizations.  While Plot.ly is a JavaScript library, it also has APIs for Python / Jupyter Notebooks, R, Excel and a few others.  Most of this was open source, but until last week, the JavaScript library that actually generated the visualization was closed.  No Longer!
In any event, just as a quick demonstration, the code below generates a very nice interactive stacked area chart. (The code is from a Plot.ly tutorial and available here.)

 
 import cufflinks as cf
 import pandas.io.data as web
 from datetime import datetime

 start = datetime(2008, 1, 1)
 end = datetime(2008, 11, 28)

 df_gis = web.DataReader("GIS", 'yahoo', start, end)
 df_fdo = web.DataReader("FDO", 'yahoo', start, end)
 df_sp = web.DataReader("GSPC", 'yahoo', start, end)
 df = pd.DataFrame({'General Mills': df_gis.Open, 'Family Dollar Stores': df_fdo.Open, 'S&P 500': df_sp.Open})
 
df.head()
df.iplot(kind='line', fill=True,
    yTitle='Open Price', title='Top Recession Stocks',
    filename='cufflinks/stock data', world_readable')
 

Here is the output for that code:
Screen Shot 2015-11-27 at 12.45.41

There is a very thorough tutorial about Plot.ly available here.   Installing Plot.ly is very easy as well.  All you have to do is:

pip install plotly

That’s it!  Enjoy!

Share the joy
Leave a Comment