Skip to content

Category: Startup

5 Things I Learned Writing SQL with Gen AI

ChatGPT has been all over the news for the last few months and again with the release of GPT-4. At DataDistillr, we added a query assistant using this technology which allows a user to simply ask a question and DataDistillr will generate a query that answers that question. I decided to write an article about what I’ve learned and my experiences with this. I started drafting this an about a month ago, and last Friday, there was a fairly significant development whereby OpenAI removed two of the models we were using.

Now candidly, I’ve never been a fan of natural language interfaces for data. When I was at Booz Allen, there was a project called Sailfish which attempted to do just that. As I recall, it was a natural language interface over data lakes. I’ll be charitable and say that it didn’t work very well. What I observed was that it worked well enough IF the data was clean, IF the question being asked was simple and IF the question didn’t involve multiple data sources. I think it also required a lengthy onboarding process as well. What became apparent was that if you wanted it to get you an answer you had to start writing statements that were something like this:

Show me the name, average age and average purchase amount using the customers and products table for customers that were born between 1980 and 1990.

Now, if you’re like me and looking at that, you’re already thinking:

SELECT name, avg(datediff(now(), birthday)), avg(purchase)
FROM customers 
JOIN orders ON customers.id = orders.customerid
...
GROUP BY customers.id.

What I observed then was that the more complex a question you had , the more the request started to look like a weird SQL statement. At a certain point, every time I used it, I couldn’t help but think that there was really no point to this and that the user would be better served just learning SQL.

Fast forward a few years, and ChatGPT came out and it turns out that it actually does an amazingly good job of translating intent into SQL. So with that said, here are some observations I have after working with it for several months to write SQL queries.

1 Comment

It’s The Assumptions That Get You

I’ve had a number of conversations recently that have highlighted to me how not understanding people’s assumptions can really hamper conversations. I’m going to highlight questions from two recent conversations, one was with a VC and the other was from a grant for which we applied. My biggest frustration in all this is that how wrong assumptions on both parties prevented deals from moving forward. I assumed that the other parties understood what I understood about SQL, which was wrong. The other parties’ assumptions about what you couldn’t do with SQL led them to assume other things about DataDistillr that was also wrong. In any event, it was the assumptions that bit us both.

Today’s SQL is not What You Learned in 1996

We applied for a US Government grant which unfortunately we did not win. The feedback seemed to center around whether or not SQL was capable of dealing with multi-dimensional data. The reviewers seemed to think that this was not possible and would be extremely difficult. Here’s where the assumptions hurt. I assumed that the reviewer would know that modern SQL tools already support multidimensional data structures. The reviewer assumed the opposite based on their understanding of SQL. It was a lesson learned for me in that I should have put more explanatory language explaining current state. I wish however, that we could have spoken with the reviewer and explained this.

Yes, SQL Supports Nested Data

Unfortunately, SQL hasn’t coalesced around a solid standard for this, but many SQL based systems such as Drill, Spark, Presto, Postgres, MySQL and many others support querying nested data using SQL.

Leave a Comment

Hanukah of Data: Solving a Data Challenge with AI and SQL

A colleague of mine recently gave me a data challenge called the Hanukkah of Data (https://hanukkah.bluebird.sh/about/) which has 8 challenges. I decided to try them out in DataDistillr. The challenges use a fictional data set which consists of four tables: a customer table, a product table, an order table and a table linking orders to products. The data was very representative of what a customer database might look like. I did, however make some modifications to the data to facilitate querying. The address line was not split up into city, state, zip. So I imported this data into a database, and then split these fields up into separate columns.

Overall, this was a really well done challenge and my compliments to the authors.

SPOILER ALERT

If you choose to read beyond this point, it will have my answers so if you want to try it yourself, stop right here. I warned you… Really … go no further.

Leave a Comment

ChatGPT, Meet DataDistillr! You’ll have lots to discuss!

Happy New Year everyone! I’m pretty excited about this. Like every other tech geek out there, I was experimenting with ChatGPT when it was announced in December of 2022.

Initially I was amazed at how well the AI appeared to work, and somewhat terrified with what people could actually do with it. I teach a database class at the University of Maryland Baltimore County (UMBC) and I was really worried that students could use ChatGPT to generate answers to essay questions on my exams. I wanted to see if there were ways of phrasing questions that would make it obvious that a person did not write them. After using ChatGPT for a while, I do think it would be possible to detect if a student was using AI to write their papers, as the quality and style are fairly distinct.

But I digress…

What really intrigued me was that these models can write SQL queries using natural language. Of course the fact that you can write a SQL query isn’t necessarily useful unless you understand the schema of the underlying data and you have a query engine or database capable of executing that query.

Well… guess what…

My team and I have been hard at work at incorporating this powerful feature into our DataDistillr. Today, I am happy to announce that we’ve added natural language AI capability to DataDistillr!

Leave a Comment

We Launched! Our Beta is Now Live!! (So I Launched a Startup Part 18)

Well, that day has finally come! After months of testing, speaking with customers and investors, our public beta is finally live! Almost exactly two years ago, I quit my job at JP Morgan, and launched DataDistillr and last week, we turned on our app for the world to try out. I would be honored if you tried it out. You can try it for free at https://app.datadistillr.io.

For a founder, this is really the big moment. I’ve always envisioned our product as a virtual “github” for data and this was finally the moment where vision meets reality. What will people say when they use your tool? Will all that work you spent on UI flows pay off or will people just look at your product as if it is the next Crystal Pepsi or something similar. The closest comparison I can think of is the feeling you get when you send your child out into the world.

Leave a Comment

Joining Difficult Data: How to Join Data on Extracted Domains (So I Launched a Startup Pt. Whatever)

One of the big challenges facing many data analysts is joining disparate data sets. Recently at DataDistillr, we assisted a prospective customer with an interesting problem which I thought I’d share. In this case, the customer had data from an internal database which had URLs in it, and was looking to create a combined report with some data from SalesForce. The only key that these data sets had in common was the domains in the URL and the domains in email addresses.

These kinds of problems are exactly the kinds of issues that we built DataDistillr for, so let’s take a look at how we might accomplish this. For this post, I generated some CSVs with customer data, one which had URLs and the other which had emails. Using DataDistillr, the process would be exactly the same whether the data was coming from a files, a SaaS platform or a database.

Leave a Comment

5 Ways Google Sheets SDK Could be better. A Tutorial on How to Integrate with Google Sheets. (Startup Part 17)

Googlesheets (GS) is one of those data sources that I think most data scientists use and probably dread a little. Using GS is easy enough, but what if a client gives you data in GS? Or worse, what if they have a lot of data in GS and other data that isn’t? Personally, whenever I’ve encountered data in GS, I’ve usually just downloaded it as a CSV and worked on it from there. This works fine, but if you have to do something that requires you to pull this data programmatically? This is where it gets a lot harder. This article will serve as both a rant and tutorial for anyone who is seeking to integrate GoogleSheets into their products.

I decided that it would be worthwhile to write a connector (plugin) for Apache Drill to enable Drill to read and write Google Sheets. After all, after Excel, they are probably one of the most common ways people store tabular data. We’ve really wanted to integrate GoogleSheets into DataDistillr as well, so this seemed like a worthy project. You can see this in action here:

So where to start?

Aha! You say… Google provides an API to access GS documents! So what’s the problem? The problem is that Google has designed what is quite possibly one of the worst SDKs I have ever seen. It is a real tour de force of terrible design, poor documentation, inconsistent naming conventions, and general WTF.

To say that this was an SDK designed by a committee is giving too much credit to committees. It’s more like a committee who spoke one language, hired a second committee which spoke another language to retain development teams which would never speak with each other to build out the API in fragments.

As you’ll see in a minute, the design decisions are horrible, but this goes far beyond bad design. The documentation, when it exists, is often sparse, incomplete, incoherent or just plain wrong. This really could be used as an exemplar of how not to design SDKs. I remarked to my colleague James who did the code review on the Drill side, that you can tell when I developed the various Drill components as the comments get snarkier and snarkier.

Let’s begin on this tour of awfulness.

Leave a Comment

What’s in a name? How to Split and Enrich People’s Names. (So I launched a Startup, Pt: 16)

One of the biggest challenges in data science and analytics is… well… the data. I did a podcast and the interviewer (Lee Ngo) asked me a question about challenges in data science. We were talking and I told him that it reminded me a lot of a story I heard about immigrants to America in the early 20th century. They came here thinking the streets were paved with gold, but when they arrived, they discovered that not only were the streets not paved with gold, they were not paved at all, and they were expected to pave them. What does this have to do with data?

Well, it always surprises me when new data scientists or data analysts are surprised when they start a project and the data is rubbish. Especially when organizations are early in their data journey, it is very common to have data that is extremely difficult to work with. When this happens, you can either complain about it, as many data scientists are wont to do, or you can roll up your sleeves and start cleaning.

With that said, one of the things that always surprises me is how little data cleaning is often offered in most data tools, which in turn has spawned an entire industry of data cleaning or data quality tools which need to be bolted into your data stack. I’ve always felt this was silly and that this is a basic functionality that analytic tools should just have. So DataDistillr does! But I digress, let’s get back to the original topic of cleaning names.

Leave a Comment

Using DataDistillr to Clean and Analyze Data (So I launched a Startup: Pt. 15)

My Medium feed thankfully has returned to tech articles and one popped up that caught my attention: Data Analysis project- Using SQL to Clean and Analyse Data. I created my startup DataDistillr to help in such situations and I was wondering could we accomplish the same tasks in less time and effort. In this article, the author takes some data from Real World Fake Data in a CSV file and does the following:

  • Step 1. Create a MySQL database from the CSV file
  • Step 2. Load the CSV file into the database
  • Step 3. Clean the data
  • Step 4. Exploratory Data Analysis
  • Step 5. Create a dashboard

I thought it would be a great example of how DataDistillr can accelerate your time to value simply by walking through this use case using DataDistillr.

Leave a Comment

Excel vs. The Data Lake (So I Launched a Startup Pt. 14)

It seems I always start my posts with “it’s been a while since I last posted…”. Well, it has. The truth is that during the month of April, I did something I hadn’t done in a while: took a vacation. That was nice, but then I got COVID (not fun), which everyone in my household got. Then I went to Singapore. More on that later. I’ve also been very heads down with some coding and technical stuff relating to security on our startup. I’ll tell you all about that in part 15. Anyway… I wanted to share a long overdue update.

A week ago, I posted a LinkedIn poll where I asked the question: “How many of you use Excel or GoogleSheets as a database?” In my extremely unscientific poll, about 45% of the responses said that they did. Indeed, this poll mirrors my experience in talking with customers in that despite massive corporate investments in platforms like Snowflake, Splunk, data lakes, delta lakes, lake houses, tons of work is done in Excel. The result is that for many organizations, a significant amount of their most valuable data isn’t in their data lake.

Leave a Comment