The Apache Drill PMC is pleased to announce a milestone release of Apache Drill. Since the last release of Drill the team has been hard at work quashing bugs and making overall functionality improvements. The TL;DR includes the following:
New connectors including Apache Iceberg, Delta Lake, Microsoft Access, GoogleSheets, and Box
Efficient cross-cloud query capability
Greatly improved access controls to include user translation support for all storage plugins
Greatly improved query planning and implicit casting.
New BI-focused SQL operators including PIVOT, UNPIVOT, EXCEPT and INTERSECT
New functions for computing regression lines and trends.
New and updated date manipulation functions.
Overall, Drill 1.21 is much more capable and stable than previous versions.
I’ve had a number of conversations recently that have highlighted to me how not understanding people’s assumptions can really hamper conversations. I’m going to highlight questions from two recent conversations, one was with a VC and the other was from a grant for which we applied. My biggest frustration in all this is that how wrong assumptions on both parties prevented deals from moving forward. I assumed that the other parties understood what I understood about SQL, which was wrong. The other parties’ assumptions about what you couldn’t do with SQL led them to assume other things about DataDistillr that was also wrong. In any event, it was the assumptions that bit us both.
Today’s SQL is not What You Learned in 1996
We applied for a US Government grant which unfortunately we did not win. The feedback seemed to center around whether or not SQL was capable of dealing with multi-dimensional data. The reviewers seemed to think that this was not possible and would be extremely difficult. Here’s where the assumptions hurt. I assumed that the reviewer would know that modern SQL tools already support multidimensional data structures. The reviewer assumed the opposite based on their understanding of SQL. It was a lesson learned for me in that I should have put more explanatory language explaining current state. I wish however, that we could have spoken with the reviewer and explained this.
Yes, SQL Supports Nested Data
Unfortunately, SQL hasn’t coalesced around a solid standard for this, but many SQL based systems such as Drill, Spark, Presto, Postgres, MySQL and many others support querying nested data using SQL.
A colleague of mine recently gave me a data challenge called the Hanukkah of Data (https://hanukkah.bluebird.sh/about/) which has 8 challenges. I decided to try them out in DataDistillr. The challenges use a fictional data set which consists of four tables: a customer table, a product table, an order table and a table linking orders to products. The data was very representative of what a customer database might look like. I did, however make some modifications to the data to facilitate querying. The address line was not split up into city, state, zip. So I imported this data into a database, and then split these fields up into separate columns.
Overall, this was a really well done challenge and my compliments to the authors.
If you choose to read beyond this point, it will have my answers so if you want to try it yourself, stop right here. I warned you… Really … go no further.
Happy New Year everyone! I’m pretty excited about this. Like every other tech geek out there, I was experimenting with ChatGPT when it was announced in December of 2022.
Initially I was amazed at how well the AI appeared to work, and somewhat terrified with what people could actually do with it. I teach a database class at the University of Maryland Baltimore County (UMBC) and I was really worried that students could use ChatGPT to generate answers to essay questions on my exams. I wanted to see if there were ways of phrasing questions that would make it obvious that a person did not write them. After using ChatGPT for a while, I do think it would be possible to detect if a student was using AI to write their papers, as the quality and style are fairly distinct.
But I digress…
What really intrigued me was that these models can write SQL queries using natural language. Of course the fact that you can write a SQL query isn’t necessarily useful unless you understand the schema of the underlying data and you have a query engine or database capable of executing that query.
Well… guess what…
My team and I have been hard at work at incorporating this powerful feature into our DataDistillr. Today, I am happy to announce that we’ve added natural language AI capability to DataDistillr!
This is going to piss people off. I took a road trip a few weeks ago to New York and listened to an interview with Mark Zuckerberg where he discussed the Metaverse and Meta’s plans for it. The whole time I was thinking… this is complete bullshit. I feel that in the tech world there is so much bullshit out there, that I really needed to write a post about it and share my views on the subject.
My criteria for bullshit tech are:
Over hyped in relationship to actual usefulness
Over hyped in terms of current state of technology and unlikely to realize the vision in a reasonable amount of time.
Unlikely to provide any real value in the near term
Well, that day has finally come! After months of testing, speaking with customers and investors, our public beta is finally live! Almost exactly two years ago, I quit my job at JP Morgan, and launched DataDistillr and last week, we turned on our app for the world to try out. I would be honored if you tried it out. You can try it for free at https://app.datadistillr.io.
For a founder, this is really the big moment. I’ve always envisioned our product as a virtual “github” for data and this was finally the moment where vision meets reality. What will people say when they use your tool? Will all that work you spent on UI flows pay off or will people just look at your product as if it is the next Crystal Pepsi or something similar. The closest comparison I can think of is the feeling you get when you send your child out into the world.
One of the big challenges facing many data analysts is joining disparate data sets. Recently at DataDistillr, we assisted a prospective customer with an interesting problem which I thought I’d share. In this case, the customer had data from an internal database which had URLs in it, and was looking to create a combined report with some data from SalesForce. The only key that these data sets had in common was the domains in the URL and the domains in email addresses.
These kinds of problems are exactly the kinds of issues that we built DataDistillr for, so let’s take a look at how we might accomplish this. For this post, I generated some CSVs with customer data, one which had URLs and the other which had emails. Using DataDistillr, the process would be exactly the same whether the data was coming from a files, a SaaS platform or a database.
Googlesheets (GS) is one of those data sources that I think most data scientists use and probably dread a little. Using GS is easy enough, but what if a client gives you data in GS? Or worse, what if they have a lot of data in GS and other data that isn’t? Personally, whenever I’ve encountered data in GS, I’ve usually just downloaded it as a CSV and worked on it from there. This works fine, but if you have to do something that requires you to pull this data programmatically? This is where it gets a lot harder. This article will serve as both a rant and tutorial for anyone who is seeking to integrate GoogleSheets into their products.
I decided that it would be worthwhile to write a connector (plugin) for Apache Drill to enable Drill to read and write Google Sheets. After all, after Excel, they are probably one of the most common ways people store tabular data. We’ve really wanted to integrate GoogleSheets into DataDistillr as well, so this seemed like a worthy project. You can see this in action here:
So where to start?
Aha! You say… Google provides an API to access GS documents! So what’s the problem? The problem is that Google has designed what is quite possibly one of the worst SDKs I have ever seen. It is a real tour de force of terrible design, poor documentation, inconsistent naming conventions, and general WTF.
To say that this was an SDK designed by a committee is giving too much credit to committees. It’s more like a committee who spoke one language, hired a second committee which spoke another language to retain development teams which would never speak with each other to build out the API in fragments.
As you’ll see in a minute, the design decisions are horrible, but this goes far beyond bad design. The documentation, when it exists, is often sparse, incomplete, incoherent or just plain wrong. This really could be used as an exemplar of how not to design SDKs. I remarked to my colleague James who did the code review on the Drill side, that you can tell when I developed the various Drill components as the comments get snarkier and snarkier.
One of the biggest challenges in data science and analytics is… well… the data. I did a podcast and the interviewer (Lee Ngo) asked me a question about challenges in data science. We were talking and I told him that it reminded me a lot of a story I heard about immigrants to America in the early 20th century. They came here thinking the streets were paved with gold, but when they arrived, they discovered that not only were the streets not paved with gold, they were not paved at all, and they were expected to pave them. What does this have to do with data?
Well, it always surprises me when new data scientists or data analysts are surprised when they start a project and the data is rubbish. Especially when organizations are early in their data journey, it is very common to have data that is extremely difficult to work with. When this happens, you can either complain about it, as many data scientists are wont to do, or you can roll up your sleeves and start cleaning.
With that said, one of the things that always surprises me is how little data cleaning is often offered in most data tools, which in turn has spawned an entire industry of data cleaning or data quality tools which need to be bolted into your data stack. I’ve always felt this was silly and that this is a basic functionality that analytic tools should just have. So DataDistillr does! But I digress, let’s get back to the original topic of cleaning names.
My Medium feed thankfully has returned to tech articles and one popped up that caught my attention: Data Analysis project- Using SQL to Clean and Analyse Data. I created my startup DataDistillr to help in such situations and I was wondering could we accomplish the same tasks in less time and effort. In this article, the author takes some data from Real World Fake Data in a CSV file and does the following:
Step 1. Create a MySQL database from the CSV file
Step 2. Load the CSV file into the database
Step 3. Clean the data
Step 4. Exploratory Data Analysis
Step 5. Create a dashboard
I thought it would be a great example of how DataDistillr can accelerate your time to value simply by walking through this use case using DataDistillr.