I’ve been writing about my startup for the last five weeks and haven’t said a word about what it is that we’re actually building or even what my company is called. (No it isn’t Stealth Startup). Let’s start with the company name, it’s called DataDistillr. You can check out our website at DataDistillr.com. What are we building? We are building the ultimate data analytics tool. The goal is to make the world’s data easy to use and query. How are we going to do that you ask? Simple We’re going to tackle the hardest problem in data science: the data itself.
As a data scientist, imagine you could just work with any data you could get access to without having to worry about what format it is in, or what system it comes from. Whether it is coming from ElasticSearch, electronic medical record (EMR) systems, APIs, Excel files, PCAP, Quickbooks, it wouldn’t matter. You could just work with the data. What’s more, imagine you could seamlessly join these datasets without having to figure out complex transformations. Finally, imagine a tool that would let you seamlessly collaborate with your friends and colleagues. If that sounds interesting, you should sign up for early access.
The Hardest Problem in Data Science: Dealing with Data
At the risk of sounding like I’m full of it, DataDistillr radically changes the way you work with data and ultimately lets you, as the data scientist or business analyst spend time on actually creating value from your data rather than fighting with it. Let me give you an example. Let’s say someone from your c-suite comes down and asks you “how did COVID affect our web traffic?”.
To answer this question with any degree of rigor, you’ll need at least two data sources: the web access logs and some set of COVID statistics. Web access logs can be stored as log files, but in most cases are sent to a system like Splunk or Elasticsearch to enable rapid analytics of this data. The problem arises with the COVID data. In order to accomplish this task, an analyst would have to join the web access data with current, up to date statistics AND you’d have to find some key in the COVID data that matches up with the web access data. This is where things start to get complicated and where our tool shines.
For COVID data, you would need data that is updated regularly, and since our notional business isn’t in the COVID tracking business, we’ll need to pull this data from some remote source. For this let’s say one of the many public APIs that provide COVID tracking data.
The problem here is that to answer the original question, you not only have to convert that data into a table, but you also have to find or create a common key, then join these two data sets. You will likely also have to perform some filtering, cleanup and aggregation on this data. Since all this data in not in the same system, you will also have to devise some way of essentially putting the data together as well. If I was doing this, I’d probably default to writing a python script, as would many data scientists.
Controversial Statement: You shouldn’t code to explore data
My belief is that data scientists SHOULD NOT have to write code to explore data. Period.
Why you ask? There are a many reasons, but the main one is speed. I’m not talking about the speed and efficiency of the data processing, though that is an issue as well, but rather the development time. Consider the following. Which will take you less time to write, a SQL query that joins several datasets, filters them, and then calculates summary stats or comparable python or R code?
Depending on how good of a coder you are and the complexity of the datasets, things like this can take a few minutes to many hours. There will inevitably be hiccups, such as null rows, incorrect data types, malformed data etc, that you will have to deal with. The bottom line is that this kind of data exploration takes time, and if you are having to write scripts to explore this data, you will have to debug these scripts. What’s worse is that these kinds of scripts are essentially one query, so you are custom coding every single query you write, AND you probably aren’t thinking about efficiency either.
So, what’s the alternative?
How Does DataDistillr Help?
Ok, you’ve either stopped reading, you think I’m full of it, or are mildly curious at this point. Have no fear brave data scientist, read on!
DataDistillr is a data exploration tool which uses what we call a data abstraction layer. The basic idea is that you connect your data sources to our tool, put a SQL query into a black box and you get data back. It does not matter what the underlying data source is because your experience is exactly the same! You also do not have to define the schema because we figure that out for you.
You want to join an Excel file with Elasticsearch or Splunk? No problem. You want to explore multi gig XML files? No problem. You want to query a random API and join it with HDF5, no problem. You want to join Google Analytics with Facebook pixel data, no problem. You want to…. you get the idea. Our tool uses a superset of ANSI SQL, so if you are comfortable with SQL, you already know how to use our tool. This is great for people like me who have to use tools like Splunk but don’t want to learn proprietary query languages.
Now, we are still early, so there is a lot of functionality that we’re still building but after using DataDIstillr for a little while, you’ll find it so easy, fast and intuitive you’ll never want to go back to writing code for these tasks.
But I don’t want a graphical tool. I like Jupyter!
So do we! In fact, we have an interface for Jupyter notebooks that allows you to directly pipe query results into a pandas data frame in a Jupyter notebook. What’s more is that DataDistillr allows you to create APIs from your query results so you can easily send these results to downstream systems.
I know there is an inherent resistance among data scientists to use graphical tools at all. But I’d encourage you to ask yourself: Do I enjoy data wrangling? If the answer is no (and a recent study shows that around 75% of data scientists say is the least favorite part of their job), what do you have to lose if you can use a tool that will make this task much less painful? If that sounds interesting, you should sign up for early access.
What else does DataDistillr Do?
Even though this wretched pandemic is slowly winding down, data scientists still struggle with collaborating with their colleagues on data projects, and this is another area in which DataDistillr can help. I’ll share a story. At one of my previous jobs, I built a machine learning model for a particular task. I used a Jupyter notebook for this and wanted to demonstrate this to my colleagues in IT who would have to deploy this and also to management so they’d see the project. Turns out, none had Jupyter and couldn’t open the notebook. Worse, I couldn’t save it as a PDF either because we weren’t allowed to install the necessary drivers for that…
Anyway, our tool features advanced collaboration tools which facilitate data teams working together on projects. The user experience is incredibly intuitive and you’ll find that after working with DataDistillr for a little bit, you’ll never want to email datasets to anyone again. I know there is an inherent resistance among data scientists to use graphical tools at all. But I’d encourage you to ask yourself: Do I enjoy data wrangling? If the answer is no (and a recent study shows that around 75% of data scientists say is the least favorite part of their job), what do you have to lose if you can use a tool that will make this task much less painful? If that sounds interesting, you should sign up for early access.
So that’s it! That’s my startup! I hope you’ll consider trying it out.
[…] my last installment, I finally revealed what it is that we are actually building. Now back to the […]