I hope everyone is enjoying Thanksgiving! This week, there were several new developments in terms of data science tools which I would like to highlight. I am a big believer of staying up to date in terms of what new tools are being developed in that you can make yourself much more efficient by better using the available tools. Both tools highlighted here represent significant potential in terms of being able to get data more efficiently and being able to more effectively present data.
Apache Drill Releases version 1.3.
On 23 November, the Drill team released Drill version 1.3. The complete release notes are available here, but for me, the biggest improvement is the text file header parsing.
In my opinion, one of the things Drill did very poorly in previous versions was CSV parsing. In prior versions, when you used Drill to query a CSV file, Drill would store each row into an array called Columns, and if you were querying a CSV file in Drill you had to use the columns array and assign each column a name:
SELECT columns[0] AS firstName, columns[1] as lastName
FROM cp.`somefile.csv`
This clearly was a less than optimal solution and results in very convoluted queries. However, with the advent of version 1.3, Drill now can be configured to derive the column names from the original CSV file. You can still configure drill to operate in the old manner, but I can’t imagine you’d want to, and you can write queries like this for CSV files:
SELECT firstName, lastName
FROM cp`.somefile.csv`
Drill will still work with data that has no headers. It treats this kind of data as it used to in the past.
The HTTPD log parser still hasn’t made it into a stable version, but I’m following the conversation between the developers closely and it looks like it will be included in version 1.4.
Plot.ly Now Open Source
If you are into data visualization (and what data scientist shouldn’t be?) you’ll be pleased to know that as of a few days ago, the JavaScript library Plot.ly is now completely free and open source. I teach a lot of data science classes and clearly a subject which we feature in our training is data visualization. The unfortunate reality that I have encountered is that if you want to create really nice visualizations quickly, you either:
- Have to pay a lot of money for BI tools such as Tableau or RShiny. OR
- Learn to code in JavaScript and create them using D3.
It is true that several easy to use libraries such as Bokeh, Seaborn, Vincent and a few others are getting a lot better. Also Apache Zeppelin is a promising notebook-like tool which enables quick, interactive data visualization, but I digress…
What is Plot.ly and Why Should I Care?
Plot.ly is a JavaScript framework for easily making beautiful interactive visualizations, however you don’t actually have to know JavaScript to use it for visualizations. While Plot.ly is a JavaScript library, it also has APIs for Python / Jupyter Notebooks, R, Excel and a few others. Most of this was open source, but until last week, the JavaScript library that actually generated the visualization was closed. No Longer!
In any event, just as a quick demonstration, the code below generates a very nice interactive stacked area chart. (The code is from a Plot.ly tutorial and available here.)
import cufflinks as cf
import pandas.io.data as web
from datetime import datetime
start = datetime(2008, 1, 1)
end = datetime(2008, 11, 28)
df_gis = web.DataReader("GIS", 'yahoo', start, end)
df_fdo = web.DataReader("FDO", 'yahoo', start, end)
df_sp = web.DataReader("GSPC", 'yahoo', start, end)
df = pd.DataFrame({'General Mills': df_gis.Open, 'Family Dollar Stores': df_fdo.Open, 'S&P 500': df_sp.Open})
df.head()
df.iplot(kind='line', fill=True,
yTitle='Open Price', title='Top Recession Stocks',
filename='cufflinks/stock data', world_readable')
Here is the output for that code:
There is a very thorough tutorial about Plot.ly available here. Installing Plot.ly is very easy as well. All you have to do is:
pip install plotly
That’s it! Enjoy!