I just got back from a data conference organized by the Data Council, which took place in Austin, TX. It was the first data conference I’ve attended since the pandemic started and it was really great to be able to attend and meet many colleagues from around the country and the world. I know I’ve said this many times, but I always come back from conferences feeling very inspired and usually with a wealth of new ideas. Oh, and in case you missed the title, VIRTUAL CONFERENCES SUCK! There, I said it!
I was invited to attend by one of our investors and it was really great to meet some of the people with whom I’ve been working for the last year and a half. I went to this conference with a mission: I wanted to learn how other early companies position themselves in a very crowded data space. One of the big challenges we’ve faced at DataDistillr is how we position ourselves. If you haven’t read earlier installments or read our website, DataDistillr is a tool which can connect to any data source and query data without ETL. The goal of our tool is to enable people who work with data, data scientists and data analysts, to be able to work with data without the support of a data engineering team.
When we show our tool to data scientists, they love it. Almost immediately. More importantly, they get it. No explanation necessary. (Most of the time anyway… that’s a subject for a future post). When we show it to data engineers or IT professionals, the reaction can be almost visceral, even hostile. I’ve been on several such calls when the potential customer, usually someone from a large company, starts lecturing us about how they don’t see a use case, that existing tools already meet this need (News flash… they don’t), or that their environment is SOOO complicated that our tool will never work there. I really enjoy hearing about how our tool can’t possibly work or do the things we claim it does. In all honesty, I’ve been on enough calls that when I see this happening, I mentally check out. There clearly will be no meeting of the minds, and we’re both wasting each other’s time at that point. (Don’t worry… these calls are more than balanced out by calls where people love what they see)
Pioneers and Settlers
I wanted to write about perspectives a bit because I’ve been trying to understand why are we getting this visceral reaction from data engineers but not data scientists. This is where attending conferences is really awesome. I don’t usually attend keynotes at conferences, usually because they are useless and boring, but not this one. On the first day, Peter Wang, co-founder of Anaconda gave a brilliant presentation where he talked about the difference between data exploration and production. Or, as he put it, pioneers vs settlers. His presentation articulated a concept which I had been really struggling to articulate.
Basically, when I did cyber data analysis, I was usually the first one to start working on a project and frequently, I was looking for things that were ambiguous at best. For example, when I worked at Deutsche Bank to build analytics to discover money laundering attempts, you can start with known rules, but then you have to get creative and start looking for weird stuff. You can’t just go to your data lake and run a query like this:
SELECT * FROM data WHERE weird = true
Spoiler alert. That won’t work. You won’t find much that way. Let’s actually look at a real world scenario: you are on a cyber team and you want to identify suspicious log in attempts. One way you might approach this problem is to flag logins when the user’s mobile phone is in a different locality than a physical workstation. If you think is a valid thing to look for, you might craft a query something like this: (pseudo query)
SELECT * FROM mobile_login_logs AS mll INNER JOIN workstation_logins AS wl ON wl.emp_id = mll.emp_id
Now this might work, but we also will need to know the location of the mobile device AND the location of the workstation. This probably means that we need to bring in a third data source: workstation records. We also might want HR records to see what is the employee’s regular location. You can see that in order to answer this question, we needed 3-4 different datasets. To build an analytic like this, you need someone with the creativity to think of it, but you also need the flexibility to rapidly combine these data sets in a way that most likely they were never intended to be.
If these results were useful, the data person would want to deploy this to a production system where you could create an optimized data pipeline with this data. Here’s where the difference between pioneers and settlers comes into play.
According to Peter Wang, in the data world, a pioneer is someone who is looking to maximize learning and innovation in as short a time as possible. The data pioneer is needs as much speed and flexibility as the organization will allow. Now, let’s contrast this with the settler. The data settler would be the teams who run production infrastructure. The settler is looking to maximize stability and minimize risk. They manage known knowns, whereas the data pioneer deals in the unknown unknowns. Settlers care about things like feed stability and monitoring, data quality and the like. Pioneers just want to get their hands on the data.
You Can’t Have Settlers Without Pioneers
What has been a constant source of frustration for me was throughout my career was dealing with people who didn’t understand this concept. I realized last week that it really was that I was not explaining this in terms that people can understand. I’ve worked as a data pioneer. I’ve done things with data that others didn’t think was possible. The reason was because it wasn’t possible with a settler’s mindset.
This may seem like I am being disparaging of settlers, but I’m not. Both are necessary. You can’t have production data systems if you didn’t have someone first be that pioneer and figure out what needed to be done first. If you want real data innovation, you need pioneers.
I remember a very frustrating conversation I had with an exec at JP Morgan who kept asking me what use case I was trying to solve for. I kept telling her I don’t know, because we’re doing discovery and so we went in circles.
What Does This Have to do With DataDistillr?
I’m glad you asked… Well, technically I asked, but let’s not split hairs. I cofounded DataDistillr because I wanted to build the ultimate data discovery tool. As a data pioneer, I feel that the data discovery tools leave a lot to be desired and often, data professionals end up writing code to do what they need to do. To use an analogy, this is comparable to a mechanic forging their own wrenches because they can’t find ones that fit their needs. The fact that data scientists have to write code should be viewed as a failing of the data tools industry, not a success. Data professionals should be able to accomplish their analysis w/o having to build their own tools which is exactly what is happening today. DataDistillr is the tool that I would have wanted 5 years ago when I was doing cyber analysis on a day to day basis.
Speaking of DataDistillr, we are pleased to announce our invitation only beta testing. If you’re interested in kicking the tires, request an invitation.