Skip to content

The Dataist Posts

ChatGPT, Meet DataDistillr! You’ll have lots to discuss!

Happy New Year everyone! I’m pretty excited about this. Like every other tech geek out there, I was experimenting with ChatGPT when it was announced in December of 2022.

Initially I was amazed at how well the AI appeared to work, and somewhat terrified with what people could actually do with it. I teach a database class at the University of Maryland Baltimore County (UMBC) and I was really worried that students could use ChatGPT to generate answers to essay questions on my exams. I wanted to see if there were ways of phrasing questions that would make it obvious that a person did not write them. After using ChatGPT for a while, I do think it would be possible to detect if a student was using AI to write their papers, as the quality and style are fairly distinct.

But I digress…

What really intrigued me was that these models can write SQL queries using natural language. Of course the fact that you can write a SQL query isn’t necessarily useful unless you understand the schema of the underlying data and you have a query engine or database capable of executing that query.

Well… guess what…

My team and I have been hard at work at incorporating this powerful feature into our DataDistillr. Today, I am happy to announce that we’ve added natural language AI capability to DataDistillr!

Share the joy
Leave a Comment

Five Technologies That I Think Are Bullshit

This is going to piss people off. I took a road trip a few weeks ago to New York and listened to an interview with Mark Zuckerberg where he discussed the Metaverse and Meta’s plans for it. The whole time I was thinking… this is complete bullshit. I feel that in the tech world there is so much bullshit out there, that I really needed to write a post about it and share my views on the subject.

My criteria for bullshit tech are:

  • Over hyped in relationship to actual usefulness
  • Over hyped in terms of current state of technology and unlikely to realize the vision in a reasonable amount of time.
  • Unlikely to provide any real value in the near term
  • A gigantic waste of time and money

These are listed in no particular order…

Share the joy
Leave a Comment

We Launched! Our Beta is Now Live!! (So I Launched a Startup Part 18)

Well, that day has finally come! After months of testing, speaking with customers and investors, our public beta is finally live! Almost exactly two years ago, I quit my job at JP Morgan, and launched DataDistillr and last week, we turned on our app for the world to try out. I would be honored if you tried it out. You can try it for free at https://app.datadistillr.io.

For a founder, this is really the big moment. I’ve always envisioned our product as a virtual “github” for data and this was finally the moment where vision meets reality. What will people say when they use your tool? Will all that work you spent on UI flows pay off or will people just look at your product as if it is the next Crystal Pepsi or something similar. The closest comparison I can think of is the feeling you get when you send your child out into the world.

Share the joy
Leave a Comment

Joining Difficult Data: How to Join Data on Extracted Domains (So I Launched a Startup Pt. Whatever)

One of the big challenges facing many data analysts is joining disparate data sets. Recently at DataDistillr, we assisted a prospective customer with an interesting problem which I thought I’d share. In this case, the customer had data from an internal database which had URLs in it, and was looking to create a combined report with some data from SalesForce. The only key that these data sets had in common was the domains in the URL and the domains in email addresses.

These kinds of problems are exactly the kinds of issues that we built DataDistillr for, so let’s take a look at how we might accomplish this. For this post, I generated some CSVs with customer data, one which had URLs and the other which had emails. Using DataDistillr, the process would be exactly the same whether the data was coming from a files, a SaaS platform or a database.

Share the joy
Leave a Comment

5 Ways Google Sheets SDK Could be better. A Tutorial on How to Integrate with Google Sheets. (Startup Part 17)

Googlesheets (GS) is one of those data sources that I think most data scientists use and probably dread a little. Using GS is easy enough, but what if a client gives you data in GS? Or worse, what if they have a lot of data in GS and other data that isn’t? Personally, whenever I’ve encountered data in GS, I’ve usually just downloaded it as a CSV and worked on it from there. This works fine, but if you have to do something that requires you to pull this data programmatically? This is where it gets a lot harder. This article will serve as both a rant and tutorial for anyone who is seeking to integrate GoogleSheets into their products.

I decided that it would be worthwhile to write a connector (plugin) for Apache Drill to enable Drill to read and write Google Sheets. After all, after Excel, they are probably one of the most common ways people store tabular data. We’ve really wanted to integrate GoogleSheets into DataDistillr as well, so this seemed like a worthy project. You can see this in action here:

So where to start?

Aha! You say… Google provides an API to access GS documents! So what’s the problem? The problem is that Google has designed what is quite possibly one of the worst SDKs I have ever seen. It is a real tour de force of terrible design, poor documentation, inconsistent naming conventions, and general WTF.

To say that this was an SDK designed by a committee is giving too much credit to committees. It’s more like a committee who spoke one language, hired a second committee which spoke another language to retain development teams which would never speak with each other to build out the API in fragments.

As you’ll see in a minute, the design decisions are horrible, but this goes far beyond bad design. The documentation, when it exists, is often sparse, incomplete, incoherent or just plain wrong. This really could be used as an exemplar of how not to design SDKs. I remarked to my colleague James who did the code review on the Drill side, that you can tell when I developed the various Drill components as the comments get snarkier and snarkier.

Let’s begin on this tour of awfulness.

Share the joy
Leave a Comment

What’s in a name? How to Split and Enrich People’s Names. (So I launched a Startup, Pt: 16)

One of the biggest challenges in data science and analytics is… well… the data. I did a podcast and the interviewer (Lee Ngo) asked me a question about challenges in data science. We were talking and I told him that it reminded me a lot of a story I heard about immigrants to America in the early 20th century. They came here thinking the streets were paved with gold, but when they arrived, they discovered that not only were the streets not paved with gold, they were not paved at all, and they were expected to pave them. What does this have to do with data?

Well, it always surprises me when new data scientists or data analysts are surprised when they start a project and the data is rubbish. Especially when organizations are early in their data journey, it is very common to have data that is extremely difficult to work with. When this happens, you can either complain about it, as many data scientists are wont to do, or you can roll up your sleeves and start cleaning.

With that said, one of the things that always surprises me is how little data cleaning is often offered in most data tools, which in turn has spawned an entire industry of data cleaning or data quality tools which need to be bolted into your data stack. I’ve always felt this was silly and that this is a basic functionality that analytic tools should just have. So DataDistillr does! But I digress, let’s get back to the original topic of cleaning names.

Share the joy
Leave a Comment

Using DataDistillr to Clean and Analyze Data (So I launched a Startup: Pt. 15)

My Medium feed thankfully has returned to tech articles and one popped up that caught my attention: Data Analysis project- Using SQL to Clean and Analyse Data. I created my startup DataDistillr to help in such situations and I was wondering could we accomplish the same tasks in less time and effort. In this article, the author takes some data from Real World Fake Data in a CSV file and does the following:

  • Step 1. Create a MySQL database from the CSV file
  • Step 2. Load the CSV file into the database
  • Step 3. Clean the data
  • Step 4. Exploratory Data Analysis
  • Step 5. Create a dashboard

I thought it would be a great example of how DataDistillr can accelerate your time to value simply by walking through this use case using DataDistillr.

Share the joy
Leave a Comment

Excel vs. The Data Lake (So I Launched a Startup Pt. 14)

It seems I always start my posts with “it’s been a while since I last posted…”. Well, it has. The truth is that during the month of April, I did something I hadn’t done in a while: took a vacation. That was nice, but then I got COVID (not fun), which everyone in my household got. Then I went to Singapore. More on that later. I’ve also been very heads down with some coding and technical stuff relating to security on our startup. I’ll tell you all about that in part 15. Anyway… I wanted to share a long overdue update.

A week ago, I posted a LinkedIn poll where I asked the question: “How many of you use Excel or GoogleSheets as a database?” In my extremely unscientific poll, about 45% of the responses said that they did. Indeed, this poll mirrors my experience in talking with customers in that despite massive corporate investments in platforms like Snowflake, Splunk, data lakes, delta lakes, lake houses, tons of work is done in Excel. The result is that for many organizations, a significant amount of their most valuable data isn’t in their data lake.

Share the joy
Leave a Comment

Data Pioneers vs. Data Settlers (Startup Pt. 13)

I just got back from a data conference organized by the Data Council, which took place in Austin, TX. It was the first data conference I’ve attended since the pandemic started and it was really great to be able to attend and meet many colleagues from around the country and the world. I know I’ve said this many times, but I always come back from conferences feeling very inspired and usually with a wealth of new ideas. Oh, and in case you missed the title, VIRTUAL CONFERENCES SUCK! There, I said it!

I was invited to attend by one of our investors and it was really great to meet some of the people with whom I’ve been working for the last year and a half. I went to this conference with a mission: I wanted to learn how other early companies position themselves in a very crowded data space. One of the big challenges we’ve faced at DataDistillr is how we position ourselves. If you haven’t read earlier installments or read our website, DataDistillr is a tool which can connect to any data source and query data without ETL. The goal of our tool is to enable people who work with data, data scientists and data analysts, to be able to work with data without the support of a data engineering team.

Share the joy
Leave a Comment

So I Launched a Startup, Pt 12: MLOps and Burst Coding

I’ve written about this before but as a technical CEO and Co-Founder, my days are usually filled with meetings of various types. My day starts with a daily standup about sales and growth and can take any number of directions. Mondays usually have sprint planning meetings, Tuesdays exec meetings, Thursdays are meetings with investors etc. The unfortunate result is that I don’t have large amounts of uninterrupted time for tech work and other work that requires intense concentration.

Burst Coding: Coding for Those With No Time

Given my insane schedule, if I’m going to do any kind of technical work, it means that I have to do it in VERY short increments of time. This approach flies in the face of commonly accepted approach of software development which is that developers need long amounts of uninterrupted time to be productive. Given that I don’t have long amounts of uninterrupted time, I had to develop a way to be productive and still sleep and spend time with my family. I call it Burst Coding and here’s how it works.

Share the joy
Leave a Comment