Skip to content

Category: Data Visualization

Joining Difficult Data: How to Join Data on Extracted Domains (So I Launched a Startup Pt. Whatever)

One of the big challenges facing many data analysts is joining disparate data sets. Recently at DataDistillr, we assisted a prospective customer with an interesting problem which I thought I’d share. In this case, the customer had data from an internal database which had URLs in it, and was looking to create a combined report with some data from SalesForce. The only key that these data sets had in common was the domains in the URL and the domains in email addresses.

These kinds of problems are exactly the kinds of issues that we built DataDistillr for, so let’s take a look at how we might accomplish this. For this post, I generated some CSVs with customer data, one which had URLs and the other which had emails. Using DataDistillr, the process would be exactly the same whether the data was coming from a files, a SaaS platform or a database.

Leave a Comment

Explore REST APIs without Code!

One of the big challenges a data scientist faces is the amount of data that is not in convenient formats. One such format are REST APIs. In large enterprises, these are especially problematic for several reasons. Often a considerable amount of reference data is only accessible via REST API which means that to access this data, users are required to learn enough Python or R to access this data. But what if you don’t want to code?

The typical way I’ve seen this dealt with is to create a duplicate of the reference data in an analytic system such as Splunk or Elasticsearch. The problems with this approach are manifold. First, there is the engineering effort (which costs time and money) to set up the data flow. On top of that, you now have the duplicate cost of storage and now you are maintaining multiple versions of the same information.

Another way I’ve seen is for each API owner to provide a graphical interface for their API, which is good, but the issue there is that now the data is stove-piped and can’t be joined with other data, which defeats its purpose altogether. There has to be a better way…

Take a look at the video tutorial below to see a demo!

4 Comments

Public Data Still Lacking on COVID-19 Outbreak

As you are reading this, you are probably (like me) under quarantine or shelter in place due to the COVID-19 outbreak. As a data scientist who has been stuck in the house since 10 March, I wanted to take a look at the data and see what I could figure out. I’m not an epidemiologist and claim no expertise in health care, but I do know data science so please take what I am saying with a grain of salt.

Why is there no data?

My first observation is that very little data is actually being made publicly available. I am not sure why this is the case, but I spent a considerable amount of time digging through the WHO, CDC and other agencies’ websites and APIs and found little usable data. For example, the World Health Organization (WHO) posts daily situation reports with data, however the sitreps contain data, however the files are in PDF format. I attempted to extract these tables from the PDFs however this proved to be extremely difficult as the formatting was not consistent. It would be trivial to post this data in CSV, HDF5 or some other format that is conducive to data analysis, however the WHO did not choose to do that. I found generally the same situation at the other major health institutions such as the CDC.

Health related information in the United States is regulated by the Health Insurance Portability and Accountability Act (HIPAA), which imposes draconian fines and restrictions on private health information, so some of the secrecy may be due to this law.

1 Comment

Visualize Anything with Superset and Drill

Happy New Year everyone! I’ve been taking a bit of a blog break after completing Learning Apache Drillteaching a few classes, and some personal travel but I’m back now and have a lot planned for 2019! One of my long standing projects is to get Apache Drill to work with various open source visualization and data flow tools. I attended the Strata conference in San Jose in 2016 where I attended Maxime Beauchemin’s talk (slides available here) where he presented the tool then known as Caravel and I was impressed, really really impressed. I knew that my mission after the conference would be to get this tool to work with Drill. A little over two years later, I can finally declare victory. Caravel went through a lot of evolution. It is now an Apache Incubating project and the name has changed to Apache (Incubating) Superset.

UPDATE: The changes to Superset have been merged, so you can just install Superset as described on their website.

9 Comments

Tutorial: Visualizing Machine Learning Models

One of the big issues I’ve encountered in my teaching is explaining how to evaluate the performance of machine learning models.  Simply put, it is relatively trivial to generate the various performance metrics–accuracy, precision, recall, etc–if you wanted to visualize any of these metrics, there wasn’t really an easy way to do that.  Until now….

Decision Boundary

Recently, I learned of a new python library called YellowBrick, developed by Ben Bengfort at District Data Labs, that implements many different visualizations that are useful for building machine learning models and assessing their performance.   Many visualization libraries require you to write a lot of “boilerplate” code:  IE just generic and repetitive code, however what impressed me about YellowBrick is that it largely follows the scikit-learn API, and therefore if you are a regular user of scikit-learn, you’ll have no problem incorporating YellowBrick into your workflow.  YellowBrick appears to be relatively new, so there still are definitely some kinks to be worked out, but overall, this is a really impressive library.

2 Comments

A Few Exciting Tool Announcements!

I hope everyone is enjoying Thanksgiving!  This week, there were several new developments in terms of data science tools which I would like to highlight.  I am a big believer of staying up to date in terms of what new tools are being developed in that you can make yourself much more efficient by better using the available tools.  Both tools highlighted here represent significant potential in terms of being able to get data more efficiently and being able to more effectively present data.

drillLogoApache Drill Releases version 1.3.

On 23 November, the Drill team released Drill version 1.3.  The complete release notes are available here, but for me, the biggest improvement is the text file header parsing.

In my opinion, one of the things Drill did very poorly in previous versions was CSV parsing.  In prior versions, when you used Drill to query a CSV file, Drill would store each row into an array called Columns, and if you were querying a CSV file in Drill you had to use the columns array and assign each column a name:

SELECT columns[0] AS firstName, columns[1] as lastName
FROM cp.`somefile.csv`

This clearly was a less than optimal solution and results in very convoluted queries.  However, with the advent of version 1.3, Drill now can be configured to derive the column names from the original CSV file.  You can still configure drill to operate in the old manner, but I can’t imagine you’d want to, and you can write queries like this for CSV files:

SELECT firstName, lastName
FROM cp`.somefile.csv`

Drill will still work with data that has no headers.  It treats this kind of data as it used to in the past.

The HTTPD log parser still hasn’t made it into a stable version, but I’m following the conversation between the developers closely and it looks like it will be included in version 1.4.

Screen Shot 2015-11-27 at 01.23.06Plot.ly Now Open Source

If you are into data visualization (and what data scientist shouldn’t be?) you’ll be pleased to know that as of a few days ago, the JavaScript library Plot.ly is now completely free and open source.  I teach a lot of data science classes and clearly a subject which we feature in our training is data visualization.  The unfortunate reality that I have encountered is that if you want to create really nice visualizations quickly, you either:

  1. Have to pay a lot of money for BI tools such as Tableau or RShiny.  OR
  2. Learn to code in JavaScript and create them using D3.

It is true that several easy to use libraries such as Bokeh, Seaborn, Vincent and a few others are getting a lot better.  Also Apache Zeppelin is a promising notebook-like tool which enables quick, interactive data visualization, but I digress…

What is Plot.ly and Why Should I Care?

Plot.ly is a JavaScript framework for easily making beautiful interactive visualizations, however you don’t actually have to know JavaScript to use it for visualizations.  While Plot.ly is a JavaScript library, it also has APIs for Python / Jupyter Notebooks, R, Excel and a few others.  Most of this was open source, but until last week, the JavaScript library that actually generated the visualization was closed.  No Longer!
In any event, just as a quick demonstration, the code below generates a very nice interactive stacked area chart. (The code is from a Plot.ly tutorial and available here.)

 
 import cufflinks as cf
 import pandas.io.data as web
 from datetime import datetime

 start = datetime(2008, 1, 1)
 end = datetime(2008, 11, 28)

 df_gis = web.DataReader("GIS", 'yahoo', start, end)
 df_fdo = web.DataReader("FDO", 'yahoo', start, end)
 df_sp = web.DataReader("GSPC", 'yahoo', start, end)
 df = pd.DataFrame({'General Mills': df_gis.Open, 'Family Dollar Stores': df_fdo.Open, 'S&P 500': df_sp.Open})
 
df.head()
df.iplot(kind='line', fill=True,
    yTitle='Open Price', title='Top Recession Stocks',
    filename='cufflinks/stock data', world_readable')
 

Here is the output for that code:
Screen Shot 2015-11-27 at 12.45.41

There is a very thorough tutorial about Plot.ly available here.   Installing Plot.ly is very easy as well.  All you have to do is:

pip install plotly

That’s it!  Enjoy!

Leave a Comment