It’s The Assumptions That Get You

I’ve had a number of conversations recently that have highlighted to me how not understanding people’s assumptions can really hamper conversations. I’m going to highlight questions from two recent conversations, one was with a VC and the other was from a grant for which we applied. My biggest frustration in all this is that how wrong assumptions on both parties prevented deals from moving forward. I assumed that the other parties understood what I understood about SQL, which was wrong. The other parties’ assumptions about what you couldn’t do with SQL led them to assume other things about DataDistillr that was also wrong. In any event, it was the assumptions that bit us both.

Today’s SQL is not What You Learned in 1996

We applied for a US Government grant which unfortunately we did not win. The feedback seemed to center around whether or not SQL was capable of dealing with multi-dimensional data. The reviewers seemed to think that this was not possible and would be extremely difficult. Here’s where the assumptions hurt. I assumed that the reviewer would know that modern SQL tools already support multidimensional data structures. The reviewer assumed the opposite based on their understanding of SQL. It was a lesson learned for me in that I should have put more explanatory language explaining current state. I wish however, that we could have spoken with the reviewer and explained this.

Yes, SQL Supports Nested Data

Unfortunately, SQL hasn’t coalesced around a solid standard for this, but many SQL based systems such as Drill, Spark, Presto, Postgres, MySQL and many others support querying nested data using SQL.

I’m most familiar with the dialect for Apache Drill which comes from Apache Calcite, so this syntax should be common for many different tools. With that said, Drill supports arrays, and maps and pretty much arbitrary nesting thereof. As an example, if you have data that looks like this:

{
  "myList": [1,2,3,4],
  "myMap": {
      "foo": "v1",
      "bar": "v2"
   }
}

You can access the individual list items with a query that looks like this: SELECT myList[1] and you can access individual map components with similar notation: SELECT myMap['foo']. You can combine these to work with very heavily nested data.

There are also functions which can flatten and extract other pieces of data from these objects. Drill isn’t unique in this regard. Many other tools with SQL interfaces have similar functionality.

Other Cool Stuff You Can Do With Modern SQL

Modern SQL is much more than simple SELECT statements from databases. For instance, did you know that many systems that use SQ support pivot tables? Again, I’ll use Drill as an example, but as of version 1.21 Apache Drill supports both PIVOT and UNPIVOT operators. Even for systems that do not explicitly have support for pivot tables, this function can be recreated using conditional aggregate functions such as FILTER or CASE statements.

Many SQL-based systems can do other BI type functions such as regression or trend lines, aggregate rollup and other advanced statistical summarizations.

Bottom line here is that don’t assume that SQL is limited to 2 dimensional tables.

But Where Do You Clean The Data?

One of the questions that I get asked all the time relates to cleaning data. Specifically, how do we clean data in DataDistillr? The underlying assumption in this question is that you need a separate tool in order to clean data. This is just false.

Ted Dunning had a great quote on the back of Learning Apache Drill which was:

Drill is unique among SQL query engines in that it aspires to query data in the wild, not just data that has been tamed and domesticated.
Ted Dunning

One of the neat parts of Apache Drill is that it can read data without a predefined schema. But Drill aside, virtually every SQL base platform will have tons of data cleaning functionality. As an example, nearly every relational database will have numerous string manipulation functions as well as conditional logic so that you can clean up various data artifacts. One of DataDistillr’s attributes is that we extend this even further by adding specialized clean up functions for things like URLs, phone numbers, email addresses, IP addresses and more. But, the bottom line is that we don’t need to start with clean data, and you can clean the data directly in the tool. In fact that’s one of the points of DataDistillr.

I never assume that people will know all the nuanced SQL that we wrote for DataDistillr for cleanup, but I have made the false assumption that people are aware that you can clean up data with SQL. For the others, the opposite is true. I’m not sure why people assume that you can’t clean data using SQL… you can. Now it’s not always the best choice for that, but you certainly can do it.

How Do You Join Data If The Structure Isn’t The Same?

Ok, this the last one, but it kills me how many times I have to answer this question. First of all, if the data content isn’t the same, you can’t join it. It doesn’t matter what system you have, there’s nothing you can do. You can’t join a column containing IP Addresses with a column containing shirt sizes. Even if you could, the results would be meaningless. But let’s say you have two tables, one with a column called src_ip and another called ip_address, assuming that both of these contain IP addresses, any SQL based system can join these. The table names don’t have to be the same to join them. I assumed people knew this fact, but based on the number of times I’ve been asked, they don’t. (Repeat after me… field names don’t have to be the same to join them)

Now, what about data sources where you don’t have a common key? Ok… this is where things get interesting… This is not a feature of DataDistillr but one of SQL generally. Let me share an example of a real use case we did for a customer. Let’s say that you have a table that has email addresses and another than has URLs in it and that you want to join the two. The only commonalities are the domains in both the URLs and the email addresses. How do you do it?

Let’s look at the email address first. This is pretty easy, in that most SQL based systems will have a split function. Simply use this to split the email field on the @ sign and you now have 2 pieces. The second of which will be the domain. It looks something like this: SELECT split(<email_field>, '@')[1]) AS domain.

What about the URL? This is trickier. DataDistillr has a function called parse_url for working with domains and extracting components from them. To extract the domain portion, you’d just have to do something like this: SELECT parse_url(url_field)['host'] AS host. We’ve also done other cool things like fuzzy matching in joins and so forth.

The bottom line is that you can do all sorts of things with SQL joins. With a bit of creativity, you can join just about anything.

Share the joy