Skip to content

Excel vs. The Data Lake (So I Launched a Startup Pt. 14)

It seems I always start my posts with “it’s been a while since I last posted…”. Well, it has. The truth is that during the month of April, I did something I hadn’t done in a while: took a vacation. That was nice, but then I got COVID (not fun), which everyone in my household got. Then I went to Singapore. More on that later. I’ve also been very heads down with some coding and technical stuff relating to security on our startup. I’ll tell you all about that in part 15. Anyway… I wanted to share a long overdue update.

A week ago, I posted a LinkedIn poll where I asked the question: “How many of you use Excel or GoogleSheets as a database?” In my extremely unscientific poll, about 45% of the responses said that they did. Indeed, this poll mirrors my experience in talking with customers in that despite massive corporate investments in platforms like Snowflake, Splunk, data lakes, delta lakes, lake houses, tons of work is done in Excel. The result is that for many organizations, a significant amount of their most valuable data isn’t in their data lake.

Example: Non-Profit Data Analysis

DataDistillr was working with a major Israeli non-profit organization named ADI to assist with analysis of their donor base. It turned out that while they were keeping careful track of all their donors, all this data was being tracked in Excel spreadsheets. Without getting into the gory details, they did have an Excel template which they used for all their campaigns, so all the documents were of the same basic format, and thankfully that format was relatively clean. While ADI was able to track individual campaigns and create reports on them, it was difficult to do strategic analysis across all their campaigns.

What were their options? Since all their data was stored in Excel tables that were well structured, we had a few options:

  1. Move all this data into a relational database or other centralized data store like a Snowflake or Splunk
  2. Write a collection of Python scripts to create new spread sheets of aggregate data
  3. Write VBA (eewwww) scripts and Excel macros to pull all the data together.
  4. Cut/paste everything into one massive Excel spreadsheet

If we had unlimited time and budget, option one would probably be the best choice. However, as a non-profit, ADI didn’t have an unlimited budget. Options two and three would require the constant engagement of a skilled coder, and also would be the least flexible. In situations like these, organizations often resort to option 4, which works, but isn’t really scalable or flexible. It is also very time consuming.

Creating a Virtual Data Lake

There is a fifth option which I didn’t mention. My startup, DataDistillr! One of DataDistillr’s really powerful features is not just that it can query Excel files individually, but it can also query collections of Excel files all at once. So, all we had to do is put all these files in a directory which was accessible to DataDistillr and we were off to the races! From ADI’s perspective, this was extremely cost-effective because it involved precisely zero new infrastructure

As an example, if we wanted to create a report of everyone who donated in the last two years, we could write a query like this: (Pseudo query)

SELECT donor_first_name, donor_last_name, 
AVG(donation_amount) AS avg_donation,
COUNT(donation_amount) AS donation_count,
SUM(donation_amount) AS total_amount_donated
FROM <data_directory>.`*.xlsx`
WHERE donation_date > '2020-05-31'

The above query will iterate (recursively) over every Excel file in the data directory as if it was one big file!

While data engineers will likely cringe at this and immediately call their Snowflake representative and tell them to call ADI, there were two big advantages to this approach. The first was time savings. We were able to have these queries and reports up and running in about 6 hours. Since there were no ETL pipelines to create, we didn’t have to acquire additional infrastructure or anything really. All we had to do was connect DataDistillr to their shared folder and start writing queries.

The other big advantage was that it is easy to maintain. All the non-technical users have to do is copy new files in the appropriate shared folder and they will be able to continue to keep their reports up to date. That shared folder can be anywhere including systems like Dropbox.

What’s Next for ADI?

In many data maturity models, the first steps are simply to collect and describe past events. While ADI was able to do this on a limited basis, by working with DataDistillr, we have helped to advance ADI’s data maturity to the point where they can now discover new insights from their data and ultimately make predictions. We hope to continue to work with them as they go on their data journey and gain even more value from their data.

The biggest thing is that DataDistillr did all this by using what they already had. Aside from using DataDistillr, ADI was able to gain actionable insights from their data without ETL, without creating (and maintaining) new infrastructure, or hiring data engineers.

Are You The Next ADI?

If your organization has tons of Excel files, splattered all over the place and are struggling to get value out of this, shoot me an email. We’d love for you to try out DataDistillr and see if we can help you.

Share the joy
Leave a Reply

Your email address will not be published.