ChatGPT has been all over the news for the last few months and again with the release of GPT-4. At DataDistillr, we added a query assistant using this technology which allows a user to simply ask a question and DataDistillr will generate a query that answers that question. I decided to write an article about what I’ve learned and my experiences with this. I started drafting this an about a month ago, and last Friday, there was a fairly significant development whereby OpenAI removed two of the models we were using.
Now candidly, I’ve never been a fan of natural language interfaces for data. When I was at Booz Allen, there was a project called Sailfish which attempted to do just that. As I recall, it was a natural language interface over data lakes. I’ll be charitable and say that it didn’t work very well. What I observed was that it worked well enough IF the data was clean, IF the question being asked was simple and IF the question didn’t involve multiple data sources. I think it also required a lengthy onboarding process as well. What became apparent was that if you wanted it to get you an answer you had to start writing statements that were something like this:
Show me the name, average age and average purchase amount using the customers and products table for customers that were born between 1980 and 1990.
Now, if you’re like me and looking at that, you’re already thinking:
SELECT name, avg(datediff(now(), birthday)), avg(purchase)
JOIN orders ON customers.id = orders.customerid
GROUP BY customers.id.
What I observed then was that the more complex a question you had , the more the request started to look like a weird SQL statement. At a certain point, every time I used it, I couldn’t help but think that there was really no point to this and that the user would be better served just learning SQL.
Fast forward a few years, and ChatGPT came out and it turns out that it actually does an amazingly good job of translating intent into SQL. So with that said, here are some observations I have after working with it for several months to write SQL queries.