Well, that day has finally come! After months of testing, speaking with customers and investors, our public beta is finally live! Almost exactly two years ago, I quit my job at JP Morgan, and launched DataDistillr and last week, we turned on our app for the world to try out. I would be honored if you tried it out. You can try it for free at https://app.datadistillr.io.
For a founder, this is really the big moment. I’ve always envisioned our product as a virtual “github” for data and this was finally the moment where vision meets reality. What will people say when they use your tool? Will all that work you spent on UI flows pay off or will people just look at your product as if it is the next Crystal Pepsi or something similar. The closest comparison I can think of is the feeling you get when you send your child out into the world.
What can you do with DataDistillr?
A lot actually! DataDistillr is like github for your data. You can connect to virtually any data (in real time) query it, join it with other datasets, clean it and so much more. While we’re far from done building it, my goal is to build the tool I would have wanted when I was doing data analytics on a day to day basis.
It still honestly shocks me how quickly you can do things with DataDistillr than by writing code or using other tools.
What Are the Use Cases?
If you’ve read my posts, you know I really don’t like this question, but I’m going to answer this anyway. DataDistillr can be used for anything that involves data. Where it really shines are situations where you want to develop end-to-end analytics that involve multiple sources. For example, let’s say that you are an e-commerce business and you want to develop analytics that pull data from your marketing, sales and other platforms. Without DataDistillr, users would have to pull this data, then figure out a way to merge it (which often involves writing code). In many cases, smaller businesses simply don’t have the technical resources to do that. That’s where DataDistillr comes in. Our tool enables users to pull these complex data sets without having to set up infrastructure, or set up complex data pipelines.
Ultimately what it means is that users can get value from data much more quickly and efficiently. Here’s a link to our documentation and getting started tutorial: https://docs.datadistillr.com/getting-started/.
What Took So Long?
Well… one word: security. DataDistillr is based on an open source (OSS) query engine (Apache Drill). While Drill is really powerful, what it wasn’t was secure. To be fair, I don’t think that Drill was any better or worse than the other federated query engines out there, but let me explain a bit. We wanted our platform to enable a user to connect to their data and be able to query that data. The first issue is that you have to store credentials securely. By default, most federated query engines do a pretty poor job of this, so we added an integration to OSS Drill with Hashicorp Vault. This enables user secrets to be stored in an encrypted manner. There are a lot of other tangential benefits, but we’ll get to that later.
Securing Access To Data Sources
So now that the credentials are secure, we also had to make the connections to the data were secure. This is the really tricky part. Let’s say that userA has a database called mydb
. We first of all have to make sure that userA can access that with their credentials. Most other federated query engines require users to use service accounts to access resources. While this is fine for some use cases, there are many where that just won’t work. DataDistillr uses what we call user translation and user impersonation. User impersonation is where our query engine executes queries on file systems using the active user’s credentials. This approach simplifies access controls because it means that the access controls are preserved on the downstream file system and nothing new is really needed. This capability already existed in open source Drill, but we had to integrate that with our tech stack.
What didn’t exist in OSS Drill was the same capability for non-file systems like databases. In these cases, OSS Drill only allowed users to use service accounts to access these systems. For an enterprise, that really isn’t a good way because it means that they have to create global credentials for their data. In general, it is better for security for users to access data sources using their own credentials. To accomplish this, we added support for user translation to Drill.
User Translation enables a user to set up a connection to a data source, but then require each user to provide their own credentials to that data source. Meaning that if the organization administrator sets up a connection to a database, userA will have to add their credentials to it before they can do anything with the database. What’s more important is that userA’s queries will run with userA’s permissions and userB’s queries will run with userB’s permissions. We also had to build a good amount of UI around this to make this process flow easily. I’ve seen many tools (Splunk cough cough) that do a really poor job of access management. What would happen in situations like this is that userA creates a dashboard and shares it with userB. UserB doesn’t realize that he isn’t seeing the same numbers as userA because they have different permissions. Stuff like that…
But that wasn’t even the hardest part…
The Shared Platform
The hardest part, and the part that we kept proprietary, was segregating access to data sources. Let me explain a bit here as well. With federated query engines, they don’t really have the notion of user groups. For an individual organization, this usually isn’t an issue. Where this is an issue is our shared platform. If I’m creating a shared platform, I need a way to enable different users to create datasources, but make sure that users outside their organization had no access to these data sources. Drill uses a lot of shared resources, so it was quite tricky to make this work while maintaining a stable Drill cluster. I won’t bore you with the details, but suffice to say, this was the most complicated piece, by far. But… we did it!
Please Give it a Try!
So if you’ve read this far, Congrats!! I really do want to thank you for your interest in DataDistillr. Please check us out at https://app.datadistillr.com.