ChatGPT has been all over the news for the last few months and again with the release of GPT-4. At DataDistillr, we added a query assistant using this technology which allows a user to simply ask a question and DataDistillr will generate a query that answers that question. I decided to write an article about what I’ve learned and my experiences with this. I started drafting this an about a month ago, and last Friday, there was a fairly significant development whereby OpenAI removed two of the models we were using.
Now candidly, I’ve never been a fan of natural language interfaces for data. When I was at Booz Allen, there was a project called Sailfish which attempted to do just that. As I recall, it was a natural language interface over data lakes. I’ll be charitable and say that it didn’t work very well. What I observed was that it worked well enough IF the data was clean, IF the question being asked was simple and IF the question didn’t involve multiple data sources. I think it also required a lengthy onboarding process as well. What became apparent was that if you wanted it to get you an answer you had to start writing statements that were something like this:
Show me the name, average age and average purchase amount using the customers and products table for customers that were born between 1980 and 1990.
Now, if you’re like me and looking at that, you’re already thinking:
SELECT name, avg(datediff(now(), birthday)), avg(purchase)
FROM customers
JOIN orders ON customers.id = orders.customerid
...
GROUP BY customers.id.
What I observed then was that the more complex a question you had , the more the request started to look like a weird SQL statement. At a certain point, every time I used it, I couldn’t help but think that there was really no point to this and that the user would be better served just learning SQL.
Fast forward a few years, and ChatGPT came out and it turns out that it actually does an amazingly good job of translating intent into SQL. So with that said, here are some observations I have after working with it for several months to write SQL queries.
1. It’s not as easy as you’d think
On the one hand, it is very easy to get a GenAI model to generate a SQL query. You may even get lucky and that query may actually work. DataDistillr uses Apache Drill as our query engine, and unfortunately, the GenAI from OpenAI does a pretty poor job of generating queries in Drill’s dialect of SQL. This is to be expected as the various LLMs are trained by scraping the Internet for examples of SQL queries. There simply aren’t as many examples of Drill queries as say MySQL, so it is to be expected that GPT models will do much better producing MySQL or Postgres queries than Drill.
The good news here is that I was able to write a microservice that processes the GPT output from a random SQL dialect into Drill’s dialect. That pretty much guarantees that the the query will run. What I also found was that GPT would fail in rather predictable ways.
One of the very tricky parts of Drill queries come into play with files in the FROM clause. Let’s say that you have a MySQL database, the FROM
clause will look something like this: mysql.db.table
. This may or may not be escaped with backticks. In either case, escaping these is easy because you can split the text by the dots and then enclose the pieces in backticks. With Drill, you may have files which have an extension and the queries that GPT would generate would have something like this:
FROM `myfiles`.`somefile`.`csv`
The issue here is that it’s actually difficult to know whether the data source SHOULD in fact have three pieces enclosed by backticks or whether it is a file and should be escaped differently. We solved this issue, but this is one of many little gotchas that I found when using GPT to generate SQL queries.
There are security issues
In my experiments, I found that ChatGPT would occasionally generate SQL queries WITH OTHER RANDOM PEOPLE’s file paths in the FROM clause. This was a prime example of a rather serious security issue with ChatGPT. Again, it shouldn’t come as a surprise that it would do that, but it really isn’t good, especially if you can figure out where the query came from with some creative Google searches.
2. Natural Language isn’t precise enough
Many companies in the past have attempted to build natural language interfaces over data. Candidly, none have really caught on. It’s worth asking why that is the case. Is it a limit of the technology, or is this something that actually can be done?
I think the first question is what can we realistically expect the AI to be able to do? To really understand what I’m getting at, let’s actually take the situation in reverse. Let’s say that I have a complex dataset, and I want to write a statement that would describe it. How would you do that? We’re working with a customer who is looking to analyze donor data for campaigns and they calculate about 16 different metrics for each donor. I can’t even being to think about what question I’d ask to generate that data set. The best I could do is actually describe the data that I want. This leads me to point number three.
3. You have to know how to ask for what you want.
The more I’ve done this, the more I realize that if you have a simple statement, GPT works really well at generating SQL that answer the question. For instance:
- Which customers bought 10 products in 2022?
- Which products were sold the most?
- What cities have the most customers?
Questions like this are easy to translate into SQL with very little ambiguity. However, more complex questions or statements can be very difficult to distill into a question or statement like that. What I’ve found works best is describing the data that you want instead of asking a question. For instance, instead of: “Which stocks had the biggest change throughout the day and what was the range?” you’d say something like: “Make me a report with the day’s starting price, ending price, max price, min price and trading volume, broken down by stock symbol. Don’t forget to include… ” Statements like these are much more likely to get the desired result.
4. Feedback is Important
I’m a big believer that working with data is an iterative process. What I’ve observed is that this view holds true when using GenAI to generate SQL queries. When I tell a data scientist about what our tool does, they’ll often ask how does it create really complex queries. I’ve come to realize that GenAI generally can’t do that on the first try, not because of a lack of capability but rather because it’s very difficult to actually express an idea in English that requires a super complex query in one shot.
Yes it can be done, and in general, I’ve found that GenAI will make a reasonable guess at it. I have successfully created queries using GenAI in one shot that have involved 5+ tables. So, yes it can do that, but the next step for DataDistillr is to make the experience more iterative and conversational. I believe that users will get more success that way.
5. You still have to know your data
I don’t know if I’d call this a strength or a weakness, but in general, GenAI will generate a SQL query no matter what you ask it. Now what this means is that if you want to get the best results you still have to have an understanding of what is in your data. For instance, let’s say that you have a database of customer information, and you ask a question about customers you’ll likely get a relevant and correct query. Now, if you ask the GenAI a question about the weather, the GenAI will usually generate a query. Clearly, that query won’t work or be relevant for your data, but it will still generate one. Rarely will GenAI just throw an error and essentially say “I can’t generate a query with that information.”
The other thing is that it does help to have some understanding of how your data is labeled, especially if your data uses specialized terminology. For instance, we work with an insurance broker, and in their data, they refer to their sales representatives as producers. Their data reflects this as well. Thus if I ask GenAI a question like “Which sales reps sold the most auto policies in Q1”, the AI will fail. But… if I ask “Which producers sold the most auto policies in Q1 2022”, I’ll get what I was looking for.
The other thing is that you have to be really precise in what you ask. For instance if I ask “Which product sold the most in 2022”, I will get a different query than if I ask: “Which products sold the most in 2022.”
So where’s all this going?
I grow tired of reading that XXX tech is the future, but if ever there was a technology that is the future, GenAI is it. Like any other technology, it will be overhyped and much BS will be generated. With DataDistillr, I think our use case for GenAI is very interesting and compelling. Since our tool has a way of abstracting a user’s data, what it means is that GenAI can translate a user’s intent into an action, regardless of where their data is or what format it is in.
In plain English, it’s like Alexa for your data. You can literally just ask “Hey DataDistillr, how many widgets did we sell last year?” or “Which states had the most customers?” or whatever and DataDistillr will get you data that answers that question. If you want to try, you can sign up for free at https://app.datadistillr.io.
Your content provide a very well information on SQL with Gen AI. Thanks for posting.