The Dataist

All Great Things Part 2: The Founder’s Dilemma

All Great Things Part 2: The Founder’s Dilemma

Published December 13, 2023

I recently posted an article about the demise of DataDistillr. It was painful to write and I was worried that by doing so, it would make me look very foolish. After all, I was documenting what mistakes I felt I made that caused the failure of DataDistillr. I didn’t want to point fingers at anyone other than myself, but I did want to share some lessons which I learned that hopefully will help people in their startup journey. I did get some heat from some folks in an online community which I joined. I won’t name names, or print quotes but it wasn’t kind to say the least. (One even called me a troll!!)

With that said, after reading my list, I couldn’t help but notice that these lessons are all the things that every founder is supposed to “know”. Concepts such as stay focused, hire slow, fire fast, etc. Every founder should know these things… How is it that I didn’t?

The answer is that it isn’t so simple. When you are in the thick of things, you have to make choices while juggling many different priorities, and the answer isn’t always obvious.

I’m going to present you with a situation that actually happened to us and give you the chance to decide what you would do if you were in charge. I’m changing the details slightly so that nobody gets embarrassed but the substance of the interaction is something that really happened to us.

Share the joy

1 Comment

All Great Things…

Published December 4, 2023

Well, this is the post I’d hoped to never write, but alas, we’ve reached the conclusion that it’s time to shut down DataDistillr. We gave it our best, but in the end we weren’t able to achieve product-market fit.

What Happened?

I think several things led to our demise. Basically, it can be boiled down to changing market conditions, not building the right product and not selling what we had effectively. As CEO, I own all of this and accept responsibility for all this except for changing market conditions.

What I Learned

This is the painful part to write as it kind of makes me sound like a fool…. Ok, so maybe I am a bit of a fool, but it’s very easy to look at all this in retrospect but much more difficult to do so when you’re in the middle of it all. Also, we did do a lot of things well, like building a solid product that worked really well.

Share the joy

Leave a Comment

Why You Shouldn’t Rely on GPT to Write Code

Why You Shouldn’t Rely on GPT to Write Code

Published July 26, 2023

A lot of people have tried out ChatGPT and other LLMs for code their code writing abilities. My theory was that the LLMs would be really good at writing code to do things they’ve seen before, but not so good at things that were completely new. I started my experiments by asking ChatGPT to write me a function in python to geolocate a phone number. ChatGPT 3.5 did a relatively poor job of this, so I tried again using the playground and gpt-3.5-turbo. This time it was more successful.

That’s not bad. I like that it used the phonenumbers library rather than calling some external service.

Share the joy

Leave a Comment

Tests in a GenAI World

Tests in a GenAI World

Published June 2, 2023

I teach a graduate level data management class at the University of Maryland, Baltimore County (UMBC). Let me preface this by saying that the midterm and final exams for my course are all online, open book, open notes. Despite this, and despite all my warnings about the dire consequences of plagiarism, I routinely catch students cheating on exams. I really love teaching, and it is very disheartening for me when I have to dole out the consequences for this behavior. Usually, cheating comes in the form of students copying answers from the internet without attribution or from other students. This semester was different. This semester, students had AI.

Share the joy

Leave a Comment

A Tribute to My Grandmother and the Greatest Generation

Published June 2, 2023

April was very hard for me. At the beginning of the month, my grandmother, Irene Givre, passed away at the age of 95. For those who know me, I lost my mother when I was 10 to a brain tumor and the subsequent years of my life were chaotic and tumultuous to say the least. That’s a story for another day. My grandparents really stepped up and took care of me during those times and as a result, I was closer to them than maybe most children are. As a result, I’ve always felt closer to people of that generation than my parents. For example, I grew up to the sounds of Benny Goodman, Tommy Dorsey, Ziggy Elman and Glenn Miller instead of what most people my parent’s age listened to.

Let me tell you a bit about her. My grandmother was born in the late 1920s in Manhattan. I’m not entirely sure of the chronology, but I know she lived on the Lower East Side, in Harlem and in Hell’s Kitchen. Her parents, were immigrants from the old country. Her father came from a hops farming family and escaped from Poland around WWI. Her mother Sophie came from the area known as the Pale of Settlement, on the border between what is today Russia and Belarus. My grandmother was the first generation of that side of the family born in America.

Share the joy

2 Comments

5 Things I Learned Writing SQL with Gen AI

5 Things I Learned Writing SQL with Gen AI

Published March 31, 2023

ChatGPT has been all over the news for the last few months and again with the release of GPT-4. At DataDistillr, we added a query assistant using this technology which allows a user to simply ask a question and DataDistillr will generate a query that answers that question. I decided to write an article about what I’ve learned and my experiences with this. I started drafting this an about a month ago, and last Friday, there was a fairly significant development whereby OpenAI removed two of the models we were using.

Now candidly, I’ve never been a fan of natural language interfaces for data. When I was at Booz Allen, there was a project called Sailfish which attempted to do just that. As I recall, it was a natural language interface over data lakes. I’ll be charitable and say that it didn’t work very well. What I observed was that it worked well enough IF the data was clean, IF the question being asked was simple and IF the question didn’t involve multiple data sources. I think it also required a lengthy onboarding process as well. What became apparent was that if you wanted it to get you an answer you had to start writing statements that were something like this:

Show me the name, average age and average purchase amount using the customers and products table for customers that were born between 1980 and 1990.

Now, if you’re like me and looking at that, you’re already thinking:

SELECT name, avg(datediff(now(), birthday)), avg(purchase)
FROM customers 
JOIN orders ON customers.id = orders.customerid
...
GROUP BY customers.id.

What I observed then was that the more complex a question you had , the more the request started to look like a weird SQL statement. At a certain point, every time I used it, I couldn’t help but think that there was really no point to this and that the user would be better served just learning SQL.

Fast forward a few years, and ChatGPT came out and it turns out that it actually does an amazingly good job of translating intent into SQL. So with that said, here are some observations I have after working with it for several months to write SQL queries.

Share the joy

1 Comment

Announcing Drill 1.21: New Connectors, Functions and Much Better Stability

Published February 27, 2023

The Apache Drill PMC is pleased to announce a milestone release of Apache Drill. Since the last release of Drill the team has been hard at work quashing bugs and making overall functionality improvements. The TL;DR includes the following:

New connectors including Apache Iceberg, Delta Lake, Microsoft Access, GoogleSheets, and Box
Efficient cross-cloud query capability
Greatly improved access controls to include user translation support for all storage plugins
Greatly improved query planning and implicit casting.
New BI-focused SQL operators including PIVOT, UNPIVOT, EXCEPT and INTERSECT
New functions for computing regression lines and trends.
New and updated date manipulation functions.

Overall, Drill 1.21 is much more capable and stable than previous versions.

Share the joy

2 Comments

It’s The Assumptions That Get You

It’s The Assumptions That Get You

Published February 7, 2023

I’ve had a number of conversations recently that have highlighted to me how not understanding people’s assumptions can really hamper conversations. I’m going to highlight questions from two recent conversations, one was with a VC and the other was from a grant for which we applied. My biggest frustration in all this is that how wrong assumptions on both parties prevented deals from moving forward. I assumed that the other parties understood what I understood about SQL, which was wrong. The other parties’ assumptions about what you couldn’t do with SQL led them to assume other things about DataDistillr that was also wrong. In any event, it was the assumptions that bit us both.

Today’s SQL is not What You Learned in 1996

We applied for a US Government grant which unfortunately we did not win. The feedback seemed to center around whether or not SQL was capable of dealing with multi-dimensional data. The reviewers seemed to think that this was not possible and would be extremely difficult. Here’s where the assumptions hurt. I assumed that the reviewer would know that modern SQL tools already support multidimensional data structures. The reviewer assumed the opposite based on their understanding of SQL. It was a lesson learned for me in that I should have put more explanatory language explaining current state. I wish however, that we could have spoken with the reviewer and explained this.

Yes, SQL Supports Nested Data

Unfortunately, SQL hasn’t coalesced around a solid standard for this, but many SQL based systems such as Drill, Spark, Presto, Postgres, MySQL and many others support querying nested data using SQL.

Share the joy

Leave a Comment

Hanukah of Data: Solving a Data Challenge with AI and SQL

Hanukah of Data: Solving a Data Challenge with AI and SQL

Published January 23, 2023

A colleague of mine recently gave me a data challenge called the Hanukkah of Data (https://hanukkah.bluebird.sh/about/) which has 8 challenges. I decided to try them out in DataDistillr. The challenges use a fictional data set which consists of four tables: a customer table, a product table, an order table and a table linking orders to products. The data was very representative of what a customer database might look like. I did, however make some modifications to the data to facilitate querying. The address line was not split up into city, state, zip. So I imported this data into a database, and then split these fields up into separate columns.

Overall, this was a really well done challenge and my compliments to the authors.

SPOILER ALERT

If you choose to read beyond this point, it will have my answers so if you want to try it yourself, stop right here. I warned you… Really … go no further.

Share the joy

Leave a Comment

ChatGPT, Meet DataDistillr! You’ll have lots to discuss!

ChatGPT, Meet DataDistillr! You’ll have lots to discuss!

Published January 5, 2023

Happy New Year everyone! I’m pretty excited about this. Like every other tech geek out there, I was experimenting with ChatGPT when it was announced in December of 2022.

Initially I was amazed at how well the AI appeared to work, and somewhat terrified with what people could actually do with it. I teach a database class at the University of Maryland Baltimore County (UMBC) and I was really worried that students could use ChatGPT to generate answers to essay questions on my exams. I wanted to see if there were ways of phrasing questions that would make it obvious that a person did not write them. After using ChatGPT for a while, I do think it would be possible to detect if a student was using AI to write their papers, as the quality and style are fairly distinct.

But I digress…

What really intrigued me was that these models can write SQL queries using natural language. Of course the fact that you can write a SQL query isn’t necessarily useful unless you understand the schema of the underlying data and you have a query engine or database capable of executing that query.

Well… guess what…

My team and I have been hard at work at incorporating this powerful feature into our DataDistillr. Today, I am happy to announce that we’ve added natural language AI capability to DataDistillr!

Share the joy

Leave a Comment

The Dataist Posts

What Happened?

What I Learned

Today’s SQL is not What You Learned in 1996

Yes, SQL Supports Nested Data

SPOILER ALERT