One of the big challenges facing many data analysts is joining disparate data sets. Recently at DataDistillr, we assisted a prospective customer with an interesting problem which I thought I’d share. In this case, the customer had data from an internal database which had URLs in it, and was looking to create a combined report with some data from SalesForce. The only key that these data sets had in common was the domains in the URL and the domains in email addresses.
These kinds of problems are exactly the kinds of issues that we built DataDistillr for, so let’s take a look at how we might accomplish this. For this post, I generated some CSVs with customer data, one which had URLs and the other which had emails. Using DataDistillr, the process would be exactly the same whether the data was coming from a files, a SaaS platform or a database.
Googlesheets (GS) is one of those data sources that I think most data scientists use and probably dread a little. Using GS is easy enough, but what if a client gives you data in GS? Or worse, what if they have a lot of data in GS and other data that isn’t? Personally, whenever I’ve encountered data in GS, I’ve usually just downloaded it as a CSV and worked on it from there. This works fine, but if you have to do something that requires you to pull this data programmatically? This is where it gets a lot harder. This article will serve as both a rant and tutorial for anyone who is seeking to integrate GoogleSheets into their products.
I decided that it would be worthwhile to write a connector (plugin) for Apache Drill to enable Drill to read and write Google Sheets. After all, after Excel, they are probably one of the most common ways people store tabular data. We’ve really wanted to integrate GoogleSheets into DataDistillr as well, so this seemed like a worthy project. You can see this in action here:
So where to start?
Aha! You say… Google provides an API to access GS documents! So what’s the problem? The problem is that Google has designed what is quite possibly one of the worst SDKs I have ever seen. It is a real tour de force of terrible design, poor documentation, inconsistent naming conventions, and general WTF.
To say that this was an SDK designed by a committee is giving too much credit to committees. It’s more like a committee who spoke one language, hired a second committee which spoke another language to retain development teams which would never speak with each other to build out the API in fragments.
As you’ll see in a minute, the design decisions are horrible, but this goes far beyond bad design. The documentation, when it exists, is often sparse, incomplete, incoherent or just plain wrong. This really could be used as an exemplar of how not to design SDKs. I remarked to my colleague James who did the code review on the Drill side, that you can tell when I developed the various Drill components as the comments get snarkier and snarkier.
One of the biggest challenges in data science and analytics is… well… the data. I did a podcast and the interviewer (Lee Ngo) asked me a question about challenges in data science. We were talking and I told him that it reminded me a lot of a story I heard about immigrants to America in the early 20th century. They came here thinking the streets were paved with gold, but when they arrived, they discovered that not only were the streets not paved with gold, they were not paved at all, and they were expected to pave them. What does this have to do with data?
Well, it always surprises me when new data scientists or data analysts are surprised when they start a project and the data is rubbish. Especially when organizations are early in their data journey, it is very common to have data that is extremely difficult to work with. When this happens, you can either complain about it, as many data scientists are wont to do, or you can roll up your sleeves and start cleaning.
With that said, one of the things that always surprises me is how little data cleaning is often offered in most data tools, which in turn has spawned an entire industry of data cleaning or data quality tools which need to be bolted into your data stack. I’ve always felt this was silly and that this is a basic functionality that analytic tools should just have. So DataDistillr does! But I digress, let’s get back to the original topic of cleaning names.
My Medium feed thankfully has returned to tech articles and one popped up that caught my attention: Data Analysis project- Using SQL to Clean and Analyse Data. I created my startup DataDistillr to help in such situations and I was wondering could we accomplish the same tasks in less time and effort. In this article, the author takes some data from Real World Fake Data in a CSV file and does the following:
Step 1. Create a MySQL database from the CSV file
Step 2. Load the CSV file into the database
Step 3. Clean the data
Step 4. Exploratory Data Analysis
Step 5. Create a dashboard
I thought it would be a great example of how DataDistillr can accelerate your time to value simply by walking through this use case using DataDistillr.
It seems I always start my posts with “it’s been a while since I last posted…”. Well, it has. The truth is that during the month of April, I did something I hadn’t done in a while: took a vacation. That was nice, but then I got COVID (not fun), which everyone in my household got. Then I went to Singapore. More on that later. I’ve also been very heads down with some coding and technical stuff relating to security on our startup. I’ll tell you all about that in part 15. Anyway… I wanted to share a long overdue update.
A week ago, I posted a LinkedIn poll where I asked the question: “How many of you use Excel or GoogleSheets as a database?” In my extremely unscientific poll, about 45% of the responses said that they did. Indeed, this poll mirrors my experience in talking with customers in that despite massive corporate investments in platforms like Snowflake, Splunk, data lakes, delta lakes, lake houses, tons of work is done in Excel. The result is that for many organizations, a significant amount of their most valuable data isn’t in their data lake.
I just got back from a data conference organized by the Data Council, which took place in Austin, TX. It was the first data conference I’ve attended since the pandemic started and it was really great to be able to attend and meet many colleagues from around the country and the world. I know I’ve said this many times, but I always come back from conferences feeling very inspired and usually with a wealth of new ideas. Oh, and in case you missed the title, VIRTUAL CONFERENCES SUCK! There, I said it!
I was invited to attend by one of our investors and it was really great to meet some of the people with whom I’ve been working for the last year and a half. I went to this conference with a mission: I wanted to learn how other early companies position themselves in a very crowded data space. One of the big challenges we’ve faced at DataDistillr is how we position ourselves. If you haven’t read earlier installments or read our website, DataDistillr is a tool which can connect to any data source and query data without ETL. The goal of our tool is to enable people who work with data, data scientists and data analysts, to be able to work with data without the support of a data engineering team.
I’ve written about this before but as a technical CEO and Co-Founder, my days are usually filled with meetings of various types. My day starts with a daily standup about sales and growth and can take any number of directions. Mondays usually have sprint planning meetings, Tuesdays exec meetings, Thursdays are meetings with investors etc. The unfortunate result is that I don’t have large amounts of uninterrupted time for tech work and other work that requires intense concentration.
Burst Coding: Coding for Those With No Time
Given my insane schedule, if I’m going to do any kind of technical work, it means that I have to do it in VERY short increments of time. This approach flies in the face of commonly accepted approach of software development which is that developers need long amounts of uninterrupted time to be productive. Given that I don’t have long amounts of uninterrupted time, I had to develop a way to be productive and still sleep and spend time with my family. I call it Burst Coding and here’s how it works.
Happy New Year! It is really hard for me to believe that a little over a year ago, I quit a high-paying job at a major bank to launch a startup. So here I am… one year later and wanted to take a look back at the last year and reflect. This has, without a doubt been the hardest job I’ve ever had. It has also, by far been the most rewarding. But it is definitely not for the faint of heart. I have lived and breathed DataDistillr for the last 1.25 years.
First some updates, after building some form of a product, early stage startups’ goal is to achieve some evidence of product-market fit. For non-startup types, what this basically means is you need to prove to your investors that you’ve built some sort of product that people are willing to use and pay for. The easiest way to measure this is through what’s called annual recurring revenue or ARR, but the dollar amount isn’t the only way to measure this. A startup’s ARR target is dependent on the target customer. For instance, are you selling to large enterprises or small businesses? Is your target user an individual or a company. You get the idea.
In the last post of my startup diary I talked about the challenges of prioritizing. A lot has happened over the last few weeks which has sucked up my time for writing. One thing is that I’ve started writing a book! Alas, it is not a published book, but rather the public-facing documentation for our startup! If you want to take a look: it’s available at docs.datadistillr.com. It’s a real page turner, let me tell you!!
I’ve been writing a lot about my experience as a startup founder, and what is interesting to me is the rapid pace of change that can happen in a company. In literally one day, your entire company’s fortune can change for the better or worse.
I try to keep my blog and social media presence positive, however, I recently became aware of a scam floating around LinkedIn, and I wanted to share my experience and hopefully prevent others from falling victim to these scammers.
There is a company currently called AdvisoryCloud. In the past, this company has been known as ExecRanks, TheExecRanks, and AdvisoryCloud.
How the Scam Works
Basically the company (whatever it is called these days) posts ads like the one below on LinkedIn targeting mid level managers for board positions in their local area.
Whether these jobs actually exist at all is something for the local Attorney General to investigate, but customers “apply” for these positions by simply filling out a form with their contact information. Once they do that, you are contacted by a friendly representative from the company who will tell you that you can earn around $30k annually from being on an advisory board, (or get equity).
However, in order to apply, you have to subscribe to their platform and pay a $200 registration fee as well as a non-refundable $195 per month for access to their platform.