Skip to content

The Dataist Posts

The Case for Generalist Data Scientists

I recently read an article by Daniel Tunkelang entitled Data Scientists: Generalists or Specialists? and it resonated with me.  I’ve been involved with hiring data scientists for some time now and I also get a lot of recruiters contacting me about various data science jobs.  My general observation is that when companies search for data scientists, they tend to use the equation (Machine Learning = Data Science), and tend to play down all the other skills that make up data science, such as creativity, critical thinking, data preparation etc.

Tunkelang writes:

Early days

Generalists add more value than specialists in a company’s early days, since you’re building most of your product from scratch and something is better than nothing. Your first classifier doesn’t have to use deep learning to achieve game-changing results. Nor does your first recommender system need to use gradient-boosted decision trees. And a simple t-test will probably serve your A/B testing needs.


Later stage

Generalists hit a wall as your products mature: they’re great at developing the first version of a data product, but they don’t necessarily know how to improve it. In contrast, machine learning specialists can replace naive algorithms with better ones and continuously tune their systems. At this stage in a company’s growth, specialists help you squeeze additional opportunity from existing systems. If you’re a Google or Amazon, those incremental improvements represent phenomenal value.

So, should you hire generalists or specialists? It really does depend—and the largest factor in your decision should be your company’s stage of maturity. But if you’re still not sure, then I suggest you favor generalists, especially if your company is still in a stage of rapid growth. Your problems are probably not as specialized as you think, and hiring generalists reduces your risk. Plus, hiring generalists allows you to give them the opportunity to learn specialized skills on the job. Everybody wins.

Read the complete post here on O’Reilly.com.  What needs to be noted here is that companies will need more specific skills as their analytics mature and evolve, however in the beginning creativity, competence and critical thinking are most likely the most important skills.  I tend to agree with a lot of what Tunkelang writes, and I do get the sense that a lot of hiring managers believe their projects are a lot more mature and advanced than they really are.  Thoughts?

Share the joy
Leave a Comment

Off Topic: How to Automate Your Gas Fireplace

Home automation is a hobby of mine, and in our new home, I really wanted to automate our Heatilator gas fireplace.  However, this isn’t as straightforward as it might seem, and I really haven’t found any good tutorials out there as to how to do this.  This tutorial will show you how to connect your fireplace to your Wink Hub or any other Z-Wave controller.  I got this working and actually found that it is one of the easier things to automate.  I really like being able to set the fireplace to go on and off on a schedule.

Safety Considerations

Before you start this project, you should be comfortable with working with wiring and electricity.  If you are not, get someone else to do this.  Secondly, you will be working with wires that run near gas lines, so multiply every safety concern by at least a factor of three.  If you don’t know what you are doing, this is not the project to figure it out.  I take no responsibility for any damage or injury that may result from this tutorial.  It goes without saying that BEFORE you start cutting wires, make sure that you have either disconnected all power, or shut off the electricity at the circuit breaker. 

The wisdom of automating a gas fireplace is also debatable, however, I left the manual switch in place so you can always turn off the fireplace the “old fashioned” way using the original switch.

What You Will Need:

Remotec Zwave Dry Contact Fixture ModuleWith all that said, this really isn’t a difficult project to complete in a safe manner.  Here’s what you’ll need:

Share the joy
97 Comments

A Few Exciting Tool Announcements!

I hope everyone is enjoying Thanksgiving!  This week, there were several new developments in terms of data science tools which I would like to highlight.  I am a big believer of staying up to date in terms of what new tools are being developed in that you can make yourself much more efficient by better using the available tools.  Both tools highlighted here represent significant potential in terms of being able to get data more efficiently and being able to more effectively present data.

drillLogoApache Drill Releases version 1.3.

On 23 November, the Drill team released Drill version 1.3.  The complete release notes are available here, but for me, the biggest improvement is the text file header parsing.

In my opinion, one of the things Drill did very poorly in previous versions was CSV parsing.  In prior versions, when you used Drill to query a CSV file, Drill would store each row into an array called Columns, and if you were querying a CSV file in Drill you had to use the columns array and assign each column a name:

SELECT columns[0] AS firstName, columns[1] as lastName
FROM cp.`somefile.csv`

This clearly was a less than optimal solution and results in very convoluted queries.  However, with the advent of version 1.3, Drill now can be configured to derive the column names from the original CSV file.  You can still configure drill to operate in the old manner, but I can’t imagine you’d want to, and you can write queries like this for CSV files:

SELECT firstName, lastName
FROM cp`.somefile.csv`

Drill will still work with data that has no headers.  It treats this kind of data as it used to in the past.

The HTTPD log parser still hasn’t made it into a stable version, but I’m following the conversation between the developers closely and it looks like it will be included in version 1.4.

Screen Shot 2015-11-27 at 01.23.06Plot.ly Now Open Source

If you are into data visualization (and what data scientist shouldn’t be?) you’ll be pleased to know that as of a few days ago, the JavaScript library Plot.ly is now completely free and open source.  I teach a lot of data science classes and clearly a subject which we feature in our training is data visualization.  The unfortunate reality that I have encountered is that if you want to create really nice visualizations quickly, you either:

  1. Have to pay a lot of money for BI tools such as Tableau or RShiny.  OR
  2. Learn to code in JavaScript and create them using D3.

It is true that several easy to use libraries such as Bokeh, Seaborn, Vincent and a few others are getting a lot better.  Also Apache Zeppelin is a promising notebook-like tool which enables quick, interactive data visualization, but I digress…

What is Plot.ly and Why Should I Care?

Plot.ly is a JavaScript framework for easily making beautiful interactive visualizations, however you don’t actually have to know JavaScript to use it for visualizations.  While Plot.ly is a JavaScript library, it also has APIs for Python / Jupyter Notebooks, R, Excel and a few others.  Most of this was open source, but until last week, the JavaScript library that actually generated the visualization was closed.  No Longer!
In any event, just as a quick demonstration, the code below generates a very nice interactive stacked area chart. (The code is from a Plot.ly tutorial and available here.)

 
 import cufflinks as cf
 import pandas.io.data as web
 from datetime import datetime

 start = datetime(2008, 1, 1)
 end = datetime(2008, 11, 28)

 df_gis = web.DataReader("GIS", 'yahoo', start, end)
 df_fdo = web.DataReader("FDO", 'yahoo', start, end)
 df_sp = web.DataReader("GSPC", 'yahoo', start, end)
 df = pd.DataFrame({'General Mills': df_gis.Open, 'Family Dollar Stores': df_fdo.Open, 'S&P 500': df_sp.Open})
 
df.head()
df.iplot(kind='line', fill=True,
    yTitle='Open Price', title='Top Recession Stocks',
    filename='cufflinks/stock data', world_readable')
 

Here is the output for that code:
Screen Shot 2015-11-27 at 12.45.41

There is a very thorough tutorial about Plot.ly available here.   Installing Plot.ly is very easy as well.  All you have to do is:

pip install plotly

That’s it!  Enjoy!

Share the joy
Leave a Comment

Tutorial: Using Apache Zeppelin with MySQL

I’ve been playing with Apache Zeppelin for a little while now, and have been really impressed.  If you aren’t familiar with Zeppelin,  it is a tool for creating interactive notebooks to visualize data.  With the latest version, Zeppelin includes an interpreter for PostgreSQL and I discovered that you can use this interpreter to connect Zeppelin to a MySQL server and quickly visualize your data.

Share the joy
7 Comments

Apache Zeppelin Releases Version 0.5.5

Apache ZeppelinThe developers of Apache Zeppelin just released a new version of Apache Zeppelin.  The release notes are here, but it doesn’t look like anything too exciting.  I’d really like to see an interpreter for Apache Drill that works with Zeppelin as well as either a generic ODBC/JDBC interpreter OR a MySQL interpreter.  Both would be incredibly useful.

Share the joy
Leave a Comment

Querying Apache Drill via the RESTful API

Here is a quick iPython notebook I wrote up which demonstrates how to execute queries in Apache Drill using Drill’s RESTful interface.  I’ve had a lot of difficulties getting Drill to “talk” to Python via JDBC and ODBC.  I think the problems however are related to my computer’s configuration, but in any event, this code works.

Querying Apache Drill via RESTful Interface

Share the joy
1 Comment

Strata talk featured on ProPublica!

It looks like my Strata talk sparked some conversation and an article at ProPublica!

http://www.propublica.org/article/your-smart-home-knows-a-lot-about-you

Smart Home Knows a Lot About YouAfter reflecting on the matter more, I hope that people will start to understand that these home automation d
evices really are data collection devices for the manufacturer of the device.  The Automatic, in my opinion, while it is a very neat device, provides little information that the driver wouldn’t alrea
y know about themselves and hence little benefit to the customer.  However, to the Automatic company, when you start aggregating this data, it provides a wealth of data to them.  Therefore, devices should have some sort of ranking as to benefit to consumer vs. benefit to company.

Share the joy
Leave a Comment