Skip to content

Hanukah of Data: Solving a Data Challenge with AI and SQL

A colleague of mine recently gave me a data challenge called the Hanukkah of Data (https://hanukkah.bluebird.sh/about/) which has 8 challenges. I decided to try them out in DataDistillr. The challenges use a fictional data set which consists of four tables: a customer table, a product table, an order table and a table linking orders to products. The data was very representative of what a customer database might look like. I did, however make some modifications to the data to facilitate querying. The address line was not split up into city, state, zip. So I imported this data into a database, and then split these fields up into separate columns.

Overall, this was a really well done challenge and my compliments to the authors.

SPOILER ALERT

If you choose to read beyond this point, it will have my answers so if you want to try it yourself, stop right here. I warned you… Really … go no further.

Challenge 1: The Private Detective

The first challenge is as follows:

Sarah brought over one of the cashiers. She said, “Joe here says that one of our customers is a skilled private investigator.”

Joe nodded, “They came in awhile ago and showed me their business card, and that’s what it said. Skilled Private Investigator. And their phone number was their last name spelled out. I didn’t know what that meant, but apparently before there were smartphones, people had to remember phone numbers or write them down. If you wanted a phone number that was easy-to-remember, you could get a number that spelled something using the letters printed on the phone buttons: like 2 has “ABC”, and 3 “DEF”, etc. And I guess this person had done that, so if you dialed the numbers corresponding to the letters in their name, it would call their phone number!

“I thought that was pretty cool. But I don’t remember their name, or anything else about them for that matter. I couldn’t even tell you if they were male or female.”

Sarah said, “This person seems like they are clever and skilled at investigation. I’d like to hire them to help me find Noah’s rug before the Hanukkah dinner. I don’t know how to contact them, but apparently they shop here at Noah’s Market.”

“Can you find this private investigator’s phone number?”

When I read this, my thinking was that we’d have to find customers whose last name are more than 7 characters long. But we also have to match up the last name with the numbers. I saw in other people’s solutions, they came up with some very clever ways to transpose letters to numbers, however DataDistillr actually has two functions that really came in handy:

  • numberize which converts words to their “phone number” representation
  • normalizePhoneNumber() which removes any non-numbers from a phone number.

With these two functions, the first challenge was pretty straightforward:

SELECT `customerid`, `name`, 
getLastName(`name`) AS last_name,
numberize(getLastName(`name`)) AS number,
`address`, `citystatezip`, `birthdate`, normalizePhoneNumber(`phone`) AS phone
FROM customers
WHERE numberize(getLastName(`name`)) = normalizePhoneNumber(`phone`)

Without the aforementioned functions, I’m not quite sure how I would have solved this in SQL. Anyway, on to challenge 2!

Challenge 2: The Contractor

Challenge 2 was a little tricky and really required you to craft a query with multiple filters. The relevant parts of the challenge is below:

When the investigator returned, they said, “Apparently, this cleaner had a special projects program, where they outsourced challenging cleaning projects to industrious contractors. As they’re right across the street from Noah’s, they usually talked about the project over coffee and bagels at Noah’s before handing off the item to be cleaned. The contractors would pick up the tab and expense it, along with their cleaning supplies.

“So this rug was apparently one of those special projects. The claim ticket said ‘2017 spec JD’. ‘2017’ is the year the item was brought in, and ‘JD’ is the initials of the contractor.

“But they stopped outsourcing a few years ago, and don’t have contact information for any of these workers anymore.”

With challenges like these, I find it helpful to list out what you know. In this case, we want to find someone whose first name starts with J, their last name starts with D, who placed an order in 2017 and who bought coffee, bagels and cleaning supplies.

There is some data prep that needs to be done here as well. The customer name is unfortunately in one field. DataDistillr comes to the rescue here as well as it has a suite of functions for manipulating human names. The ones I used in this case were getFirstName and getLastName. In addition to these functions, we also need to extract the first letter from both these artifacts. Obviously there are more than one way to do this. You could use a substring function or what I did which was to use the LIKE function in the WHERE clause. We also had to extract the year from the date, which can be accomplished with the YEAR() function. So here’s my answer for challenge 2:

SELECT *
FROM orders
JOIN customers ON customers.`customerid` = orders.`customerid`
LEFT JOIN order_items AS oi ON oi.orderid = orders.`orderid`
LEFT JOIN products ON products.sku = oi.sku 
WHERE YEAR(orders.`ordered`) = 2017 AND
getFirstName(customers.`name`) LIKE 'J%' AND 
getLastName(customers.`name`) LIKE 'D%' AND oi.sku = 'HOM8601'

Challenge 3: The Guy in the Neighborhood

Challenge 3 is the first challenge that relies on information from the previous challenge:

… “At last I couldn’t deal with the rug taking up my whole bathtub, so I gave it to this guy who lived in my neighborhood. He said that he was naturally assertive because he was a Aries born in the year of the Dog, so maybe he was able to clean it.

“I don’t remember his name. Last time I saw him, he was leaving the subway and carrying a bag from Noah’s. I swore I saw a spider on his hat.”…

What we know here is that we need to find someone who:

  • Lives in “my neighborhood”, which we know from the last challenge to be South Ozone Park, NY 11420.
  • Is an Aries which means they were born roughly between March 18 and April 20th and
  • Were born in a lunar year of the dog.

I think it was at this point that I decided to split up the citystatezip field into separate components and use the zip code as a way for determining the location. The dataset provides the birthday, so we can use a variety of SQL possibilities to extract the day and month. I ended up with something like (MONTH(birthdate) = 3 AND DAY(birthdate) > 19) OR MONTH(birthday) =4 AND DAY(birthday) < 20).

Now we still needed to figure out the year of the dog part. For this, I just did a google search to see which were the years of the dog and included something like this: YEAR(birthday) IN (...), however after I solved this and saw there’s a more elegant way of doing that. You can actually take the modulus of the year and if it is 2, then you have a year of the dog.

Thus a query would look something like this:

SELECT `name`, `phone`, `birthdate`
FROM `noahdb`.`customers`
WHERE
zip = 11420 AND 
MOD(YEAR(birthdate),12) = 2 AND
(MONTH(birthdate) = 3 AND DAY(birthdate > 21) AND  
(MONTH(birthdate) = 4 AND DAY(birthdate) < 20)

I couldn’t find my original query, so this is an approximation. I tend to build these in pieces, so it turns out that not all the logic was necessary for this.

Challenge 4: The Pastry Lover

Ok, these are starting to get more challenging! For challenge 4 we need to start

“A few weeks later my bike chain broke on the way home, and I needed to get it fixed before work the next day. Thankfully, this woman I met on Tinder came over at 5am with her bike chain repair kit and some pastries from Noah’s. Apparently she liked to get up before dawn and claim the first pastries that came out of the oven

For this one, we know that the person we’re trying to find is: female, and she purchased pastries from the store in question before 5AM. We can also infer that this person liked to do this a lot.

We also had to a bit of data exploration here. It turns out that there are multiple baked goods in the store’s inventory. But the products we want have a SKU starting with BKY.

For this one, I decided to try DataDistillr’s new AI assistant to see if it could generate a query that would answer this question. I asked it: “Show me all the customers who bought the most products where the SKU starts with BKY and who placed the order between 4 and 6 AM“.

The AI assistant ALMOST got it. It was super close. GPT doesn’t understand that when you aggregate, all projected fields must be in the GROUP BY clause or an Aggregate function. Other than messing up the GROUP BY clause, this was the final query:

SELECT
  `customers`.`name`,
  `customers`.`address`,
  `customers`.`city`,
  `customers`.`state`,
  `customers`.`zip`,
  `customers`.`birthdate`,
  `customers`.`phone`,
  COUNT(`orders_items`.`sku`) AS `total_items`
FROM `noahdb`.`customers`
JOIN `noahdb`.`orders`
  ON `customers`.`customerid` = `orders`.`customerid`
JOIN `noahdb`.`orders_items`
  ON `orders`.`orderid` = `orders_items`.`orderid`
WHERE
  `orders_items`.`sku` LIKE 'BKY%'
  AND hour(`orders`.`ordered`) BETWEEN 4 AND 6
GROUP BY
   `customers`.`name`,
  `customers`.`address`,
  `customers`.`city`,
  `customers`.`state`,
  `customers`.`zip`,
  `customers`.`birthdate`,
  `customers`.`phone`
ORDER BY
  `total_items` DESC
LIMIT 10

From this we see that Christina Booker is the winner.

Challenge 5: The Cat Lady

For challenge 5 I started to see a bit of a pattern in that you had to pick out subtle clues as to what the person had purchased to find the target. The relevant parts of the challenge are listed below:

“I listed it on Freecycle, and a woman in Queens Village came to pick it up. She was wearing a ‘Noah’s Market’ sweatshirt, and it was just covered in cat hair. When I suggested that a clowder of cats might ruin such a fine tapestry, she looked at me funny and said she only had ten or eleven cats and they were getting quite old and had cataracts now so they probably wouldn’t notice some old rug anyway.

In this case, we needed to find someone who lived in Queens Villiage, and bought cat products, specifically for older cats. At first, I did a little exploratory data analysis on the products table to find products for cats. It turns out that there are several hundred products for cats, but if we add the word Senior to the query, there are only 38. This seems like a good place to start. The goal now, would be to find someone who is female, who bought one of these products and who lives in Queens Village.

This was my final query:

SELECT
  *
FROM `noahdb`.`products` AS products
JOIN `noahdb`.`orders_items` AS oi 
	ON oi.sku = products.sku
JOIN `noahdb`.`orders` AS orders
	ON orders.orderid = oi.orderid
JOIN `noahdb`.`customers` AS customers
	ON customers.customerid = orders.customerid
WHERE
  `desc` LIKE '%Senior Cat%' AND customers.city = 'Queens Village'

The honor goes to Ms. Anita Koch!

Challenge 6: The Cheapskate

Challenge six was a good challenge that required you to calculate a few columns and use this for a filter.

“It was a nice rug and they were surely going to ruin it, so I gave it to my cousin, who was moving into a new place that had wood floors.

“She refused to buy a new rug for herself–she said they were way too expensive. She’s always been very frugal, and she clips every coupon and shops every sale at Noah’s Market. In fact I like to tease her that Noah actually loses money whenever she comes in the store.

“I think she’s been taking it too far lately though. Once the subway fare increased, she stopped coming to visit me. And she’s really slow to respond to my texts. I hope she remembers to invite me to the family reunion next year.”

Let’s pick this apart. We know that the person has shopped several times at Noah’s Market and that Noah loses money on her purchases. So how do we figure that out. We know the wholesale cost of the products, and we know the retail price and quantity of the product. So we have to take the order total, and compare it to the total cost of the order. What I did was to look to see where the order total was less than the wholesale cost of the items. Then I aggregated to find the customers who had the most such orders.

SELECT `name`, phone,  COUNT(*) AS order_count
FROM(
   SELECT customers.name, customers.city, customers.phone
   FROM `noahdb`.orders_items AS oi 
   LEFT JOIN `noahdb`.`products` ON products.sku = oi.sku 
   LEFT JOIN `noahdb`.`orders` ON orders.orderid = oi.orderid
   LEFT JOIN `noahdb`.`customers` ON customers.customerid = orders.customerid
   WHERE orders.total < (qty * wholesale_cost)
  )
GROUP BY `name`, phone
ORDER BY order_count DESC

The winner was Ms. Emily Randolph.

Challege 7: The Same Thing Buyer

Ok… this one had me stumped for a while. Here are the relevant parts:

“One day, I was at Noah’s Market, and I was just about to leave when someone behind me said ‘Miss! You dropped something!’

“Well I turned around and sure enough this cute guy was holding something I had bought. He said ‘I got almost exactly the same thing!’ We laughed about it and wound up swapping items because he had wanted the color I got. We had a moment when our eyes met and my heart stopped for a second. I asked him to get some food with me and we spent the rest of the day together.

I tried a few approaches to this, first trying to find people who bought the same product as Emily on the same day, but that didn’t work. I did a bit more EDA and found that there were certain products which were offered in different colors. The colors were noted in the description in parentheses. For example Toaster (Red) or something like that.

Once I figured this out, I decided on a strategy of trying to find people who ordered products with a color in the description on the same day as Emily Randolph. This was my query:

WITH cousin_products AS (SELECT oi.sku, orders.ordered, products.desc, TRIM(SPLIT(`desc`, '(')[0]) AS raw_product
FROM `noahdb`.`customers`
LEFT JOIN `noahdb`.`orders` ON orders.customerid = customers.customerid
LEFT JOIN `noahdb`.`orders_items` AS oi ON oi.orderid = orders.orderid
LEFT JOIN `noahdb`.`products` ON products.sku = oi.sku
WHERE customers.customerid = 8342 AND products.desc LIKE '%(%)%')

SELECT *
FROM 
`noahdb`.`orders` AS orders 
JOIN `noahdb`.`orders_items` AS oi ON oi.orderid = orders.orderid
JOIN `noahdb`.`customers` AS customers ON customers.customerid = orders.customerid
JOIN `noahdb`.`products` AS products ON oi.sku = products.sku
JOIN cousin_products ON cousin_products.raw_product = TRIM(SPLIT(products.`desc`, '(')[0]) 
AND DATE_TRUNC('DAY', cousin_products.ordered) = DATE_TRUNC('DAY', orders.ordered)
ORDER BY 1

This query returned 15 records, 9 of which were Emily Randolph. (I should have excluded her from the final result set). If I really wanted to narrow this down even further, I would have honed in on the hour as well as the day, but 6 results was pretty good. A cursory glance at the order times, and we found our winner.

Challenge 8: The Collector

For the final challenge the description insinuated that we had to find the person who had an entire collection of Noah collectables. After looking at the products table, I found that there were 81 Noah’s Collectable products, all with a prefix of COL. What we needed to do is write a query which counts the number of different products that someone has purchased AND limit that to only the collectable products.

I decided to try the AI assistant to see how much of this it could do. So, I asked it: Show me each customer and how many different products they have ordered. DataDistillr’s AI Query Assistant generated a working query that did exactly that, merging all four tables together correctly.

Next, I just had to filter by the sku. The final query was here:

SELECT
  `customers`.`name`, `customers`.`phone`,
  COUNT(DISTINCT `orders_items`.`sku`)
FROM `noahdb`.`customers`
JOIN `noahdb`.`orders`
  ON `customers`.`customerid` = `orders`.`customerid`
JOIN `noahdb`.`orders_items`
  ON `orders`.`orderid` = `orders_items`.`orderid`
WHERE SUBSTR(`orders_items`.`sku`, 1, 3) = 'COL'
GROUP BY
  `customers`.`name`, `customers`.`phone`
  HAVING COUNT(DISTINCT `orders_items`.`sku`) >= 81

Conclusion

This was a fun challenge to work on. My thanks to the developers, the Dehttps://bluebird.sh/about/#vottys, for developing such an awesome challenge!

Some tips that I have:

  • Exploring your data really helps when doing these challenges. Understanding the structure of your data will help you when reading the clues.
  • Clean your data first. When I saw the citystatezip column, I immediately thought that it was a candidate to split that up. Likewise for names. DataDistillr has special functions for splitting human names, but if your database doesn’t have that, making an effort there can make your queries a lot easier.
  • If you’re using SQL, make views of your cleaned data.
  • When reading the problem, really take a moment to think about what they are asking. Then build your query incrementally. You’ll find that sometimes you can get to the answer before you build the complete query. In any event, it’s always easier to debug something simple than complex.
Share the joy
Leave a Reply

Your email address will not be published. Required fields are marked *