Skip to content

Category: New Data Science Tools

Announcing the First Release of Griffon: A Virtual Environment for Data Science

My colleagues Austin Taylor and Melissa Kilby are proud to announce the first stable release of Griffon:  A Virtual Machine for Data Science.   Griffon is a virtual machine which contains many data science tools pre-configured, installed and linked up to make it so that you don’t have to be a Linux expert to try them out.  If you are teaching a class, or if you are simply wanting to learn more about a particular tool, then Griffon is perfect for you.

You can download Griffon here: https://github.com/gtkcyber/griffon-vm.

Leave a Comment

The Best Of Both Worlds: Joining Online And Local Datasets With Apache Drill

data.world is rapidly establishing itself as the premier site for data scientists and analysts to host and collaborate on datasets. I have been impressed with data.world’s growth and interested in starting to use the platform in my professional projects.  On data.world, datasets can be open and visible to the general public or they can be private, with visibility limited to select contributors. That is sufficient to guarantee the privacy of the data most of the time. However, in some cases, you may be explicitly prohibited from uploading data to the cloud.

Would it be possible to use data.world in a project even when part of the data must not live in the cloud? 

It didn’t take me long to answer that question. Fortunately, I also have been doing a meaningful amount of experimentation and development with Apache Drill over the last few years. What impresses me about Drill is its versatility and potential to dramatically increase analytic productivity, open up previously inaccessible data sources, query across data silos, and do so with the common language of ANSI SQL.

As I began experimenting with both, I couldn’t help but wonder if it might be possible to somehow combine the two.

Well, it turns out, it is…

Leave a Comment

The Biggest Problem in Data Science and How to Fix It

Imagine you have some process in your organization’s workflow that consumes 50%-90% of your staff’s time and contributes no value to the end result.  If you work in the data science or data analytics fields you don’t have to imagine that because I’ve just described what is, in my view, the biggest problem in advanced analytics today: the Extract/Transform/Load (ETL) process.  This range doesn’t come from thin air.  Studies from a few years ago from various sources concluded that data scientists were spending between 50%-90% of their time preparing their data for analysis.  (Example from Forbes, DatascienceCentral, New York Times) Furthermore, 76% of data scientists consider data preparation the least enjoyable part of their job, according to Forbes.

If you go to any trade show and walk the expo halls, you’ll see the latest whiz-bang tools to “leverage big data analytics in the cloud”, and you’ll be awed by some amazing visualization tools.  Many of these new products can do amazing things… once they have data, which brings us back to our original problem…that in order to use the latest whiz-bang tool you still have to invest considerable amounts of time in data prep.  The tools seem to skip that part, and focus on the final 10% of the process.

Leave a Comment

Drill UDF List

drillLogoI’ve been working on developing some custom functions for Drill, or User Defined Functions and I realized that there really should be a repository for Drill UDFs.  I’ve decided to create a page with links to all the UDFs that I know of.  I’ll keep this updated, so please if you have Drill UDFs that you want to share, please email me a link and I’ll put it up.

Leave a Comment

A Few Exciting Tool Announcements!

I hope everyone is enjoying Thanksgiving!  This week, there were several new developments in terms of data science tools which I would like to highlight.  I am a big believer of staying up to date in terms of what new tools are being developed in that you can make yourself much more efficient by better using the available tools.  Both tools highlighted here represent significant potential in terms of being able to get data more efficiently and being able to more effectively present data.

drillLogoApache Drill Releases version 1.3.

On 23 November, the Drill team released Drill version 1.3.  The complete release notes are available here, but for me, the biggest improvement is the text file header parsing.

In my opinion, one of the things Drill did very poorly in previous versions was CSV parsing.  In prior versions, when you used Drill to query a CSV file, Drill would store each row into an array called Columns, and if you were querying a CSV file in Drill you had to use the columns array and assign each column a name:

SELECT columns[0] AS firstName, columns[1] as lastName
FROM cp.`somefile.csv`

This clearly was a less than optimal solution and results in very convoluted queries.  However, with the advent of version 1.3, Drill now can be configured to derive the column names from the original CSV file.  You can still configure drill to operate in the old manner, but I can’t imagine you’d want to, and you can write queries like this for CSV files:

SELECT firstName, lastName
FROM cp`.somefile.csv`

Drill will still work with data that has no headers.  It treats this kind of data as it used to in the past.

The HTTPD log parser still hasn’t made it into a stable version, but I’m following the conversation between the developers closely and it looks like it will be included in version 1.4.

Screen Shot 2015-11-27 at 01.23.06Plot.ly Now Open Source

If you are into data visualization (and what data scientist shouldn’t be?) you’ll be pleased to know that as of a few days ago, the JavaScript library Plot.ly is now completely free and open source.  I teach a lot of data science classes and clearly a subject which we feature in our training is data visualization.  The unfortunate reality that I have encountered is that if you want to create really nice visualizations quickly, you either:

  1. Have to pay a lot of money for BI tools such as Tableau or RShiny.  OR
  2. Learn to code in JavaScript and create them using D3.

It is true that several easy to use libraries such as Bokeh, Seaborn, Vincent and a few others are getting a lot better.  Also Apache Zeppelin is a promising notebook-like tool which enables quick, interactive data visualization, but I digress…

What is Plot.ly and Why Should I Care?

Plot.ly is a JavaScript framework for easily making beautiful interactive visualizations, however you don’t actually have to know JavaScript to use it for visualizations.  While Plot.ly is a JavaScript library, it also has APIs for Python / Jupyter Notebooks, R, Excel and a few others.  Most of this was open source, but until last week, the JavaScript library that actually generated the visualization was closed.  No Longer!
In any event, just as a quick demonstration, the code below generates a very nice interactive stacked area chart. (The code is from a Plot.ly tutorial and available here.)

 
 import cufflinks as cf
 import pandas.io.data as web
 from datetime import datetime

 start = datetime(2008, 1, 1)
 end = datetime(2008, 11, 28)

 df_gis = web.DataReader("GIS", 'yahoo', start, end)
 df_fdo = web.DataReader("FDO", 'yahoo', start, end)
 df_sp = web.DataReader("GSPC", 'yahoo', start, end)
 df = pd.DataFrame({'General Mills': df_gis.Open, 'Family Dollar Stores': df_fdo.Open, 'S&P 500': df_sp.Open})
 
df.head()
df.iplot(kind='line', fill=True,
    yTitle='Open Price', title='Top Recession Stocks',
    filename='cufflinks/stock data', world_readable')
 

Here is the output for that code:
Screen Shot 2015-11-27 at 12.45.41

There is a very thorough tutorial about Plot.ly available here.   Installing Plot.ly is very easy as well.  All you have to do is:

pip install plotly

That’s it!  Enjoy!

Leave a Comment

Tutorial: Using Apache Zeppelin with MySQL

I’ve been playing with Apache Zeppelin for a little while now, and have been really impressed.  If you aren’t familiar with Zeppelin,  it is a tool for creating interactive notebooks to visualize data.  With the latest version, Zeppelin includes an interpreter for PostgreSQL and I discovered that you can use this interpreter to connect Zeppelin to a MySQL server and quickly visualize your data.

4 Comments

Apache Zeppelin Releases Version 0.5.5

Apache ZeppelinThe developers of Apache Zeppelin just released a new version of Apache Zeppelin.  The release notes are here, but it doesn’t look like anything too exciting.  I’d really like to see an interpreter for Apache Drill that works with Zeppelin as well as either a generic ODBC/JDBC interpreter OR a MySQL interpreter.  Both would be incredibly useful.

Leave a Comment