Skip to content

Tag: Data Science

Why You Shouldn’t Rely on GPT to Write Code

A lot of people have tried out ChatGPT and other LLMs for code their code writing abilities. My theory was that the LLMs would be really good at writing code to do things they’ve seen before, but not so good at things that were completely new. I started my experiments by asking ChatGPT to write me a function in python to geolocate a phone number. ChatGPT 3.5 did a relatively poor job of this, so I tried again using the playground and gpt-3.5-turbo. This time it was more successful.

That’s not bad. I like that it used the phonenumbers library rather than calling some external service.

Leave a Comment

So I Launched a Startup: Pt. 11: One Year In

Happy New Year! It is really hard for me to believe that a little over a year ago, I quit a high-paying job at a major bank to launch a startup. So here I am… one year later and wanted to take a look back at the last year and reflect. This has, without a doubt been the hardest job I’ve ever had. It has also, by far been the most rewarding. But it is definitely not for the faint of heart. I have lived and breathed DataDistillr for the last 1.25 years.

First some updates, after building some form of a product, early stage startups’ goal is to achieve some evidence of product-market fit. For non-startup types, what this basically means is you need to prove to your investors that you’ve built some sort of product that people are willing to use and pay for. The easiest way to measure this is through what’s called annual recurring revenue or ARR, but the dollar amount isn’t the only way to measure this. A startup’s ARR target is dependent on the target customer. For instance, are you selling to large enterprises or small businesses? Is your target user an individual or a company. You get the idea.

Leave a Comment

Public Data Still Lacking on COVID-19 Outbreak

As you are reading this, you are probably (like me) under quarantine or shelter in place due to the COVID-19 outbreak. As a data scientist who has been stuck in the house since 10 March, I wanted to take a look at the data and see what I could figure out. I’m not an epidemiologist and claim no expertise in health care, but I do know data science so please take what I am saying with a grain of salt.

Why is there no data?

My first observation is that very little data is actually being made publicly available. I am not sure why this is the case, but I spent a considerable amount of time digging through the WHO, CDC and other agencies’ websites and APIs and found little usable data. For example, the World Health Organization (WHO) posts daily situation reports with data, however the sitreps contain data, however the files are in PDF format. I attempted to extract these tables from the PDFs however this proved to be extremely difficult as the formatting was not consistent. It would be trivial to post this data in CSV, HDF5 or some other format that is conducive to data analysis, however the WHO did not choose to do that. I found generally the same situation at the other major health institutions such as the CDC.

Health related information in the United States is regulated by the Health Insurance Portability and Accountability Act (HIPAA), which imposes draconian fines and restrictions on private health information, so some of the secrecy may be due to this law.

1 Comment

So You Want to Write a Book…

Well, we did it.  I finally finished the book that I had been working on with my co-author for the last two years.  I thought I’d write a short post on my experiences writing a technical book and getting it published.  I know many people think about writing books, and I’d like to share my experiences so that others might learn from lessons that I learned the hard way.  Overall, it was an absolutely amazing experience and I have a feeling that the adventure is only beginning….

Leave a Comment

Why don’t Data Scientists use Splunk?

I am currently attending the Splunk .conf in Orlando, and a director at Accenture asked me this question, which I thought merited a blog post.  Why don’t data scientists use or like Splunk.  The inner child in me was thinking, “Splunk isn’t good at data science”, but the more seasoned professional in me actually articulated a more logical and coherent answer, which I thought I’d share whilst waiting for a talk to start.  Here goes:

I cannot pretend to speak for any community of “data scientists” but it is true that I know a decent number of data scientists, some very accomplished and some beginners, and not a one would claim to use Splunk as one of their preferred tools.  Indeed, when the topic of available tools comes up among most of my colleagues and the word Splunk is mentioned, it elicits groans and eye rolls.  So let’s look at why that is the case:

9 Comments

Can you use Machine Learning to detect Fake News?

Someone recently asked me for assistance with a university project whereby they were asked to predict whether a given article was fake news or not.  They had a target accuracy of 70%.  Since the topic of fake news has been in the news a lot, it made me think about how I would approach this problem and whether it is even possible to use machine learning to identify fake news.  At first glance, this problem might be comparable to spam detection, however the problem is actually much more complicated.  In an article on The VergeDean Pomerleau of Carnegie Mellon University states:

“We actually started out with a more ambitious goal of creating a system that could answer the question ‘Is this fake news, yes or no?’ We quickly realized machine learning just wasn’t up to the task.” 

Leave a Comment

Announcing the First Release of Griffon: A Virtual Environment for Data Science

My colleagues Austin Taylor and Melissa Kilby are proud to announce the first stable release of Griffon:  A Virtual Machine for Data Science.   Griffon is a virtual machine which contains many data science tools pre-configured, installed and linked up to make it so that you don’t have to be a Linux expert to try them out.  If you are teaching a class, or if you are simply wanting to learn more about a particular tool, then Griffon is perfect for you.

You can download Griffon here: https://github.com/gtkcyber/griffon-vm.

Leave a Comment

Fixing STEM Education

To both of my loyal readers, I apologize for not writing anything in a while, but I have been absolutely slammed with classes and conference presentations.  Anyway, I’ve been doing a lot of thinking about my earlier post about Teaching Data Science in English.   The post provoked a decent response, mostly positive.

One reader sent me the following comment about my post which I’ve decided to quote (with permission) in its entirety because I think it accurately reflects why people get so frustrated when they try to learn mathematical concepts. What interested me was that this individual took action and “translated from mathspeak to English” and all of a sudden she was able to understand the underlying concepts.

Awhile ago I read a piece you had written on LinkedIn about making ‘mathspeak’ and ‘techspeak’ (i.e. coding) more accessible to regular people, by decreasing mathematical notation usage and increasing the use of real words in explanations of formulas and concepts. It was something that stayed with me because I’ve always understood broader mathematical concepts but have always had trouble with the mechanics, and I think a lot of that has had to do with the amount of notation used…math seems like a foreign language sometimes, and there are 2 levels of understanding: the first is merely deciphering the ‘foreign language’, which already puts me out of my comfort zone (think reading Spanish or French if you are a native English speaker) and then understand the underlying concepts, which becomes harder due to the fact that it’s written in a ‘non-native’ language. Recently I’ve started taking an online course in machine learning on XXX. Already in the second lesson, he dove straight into notation-filled formulas, and I was starting to get that overwhelmed feeling that I’m familiar with from previous years of math. But I had what you wrote in my mind, and I thought I’d give it a shot and manually ‘translate’ the formulas and equations into English, and stick with that. Well, I did that, and it worked so well. I feel that I am able to follow along with the underlying theory of the class and by extension, the formulas and algorithms he presented in ‘mathese’ whereas before I would have shut-down and assumed it was beyond my grasp. Thanks so much for highlighting this aspect of the math/English understanding divide. It is continuously helpful for me. (Emphasis mine)

I’d like to share another related story.  One of my first paying jobs was working for KUAT public television as the web developer (www.kuat.org) and I wanted to do some things that required automating a data flow from an archaic DOS based database.  I was teamed up with a programmer who helped me build the process and in doing so, I learned how to write regular expressions.  I got so into it, I nearly automated myself out of a job.

Fast forward a year or so, when I was nearly done with my CS degree, I had to take an upper level CS course about Automata, Grammars and Languages, which included regular expressions in the course description.  I was pretty excited because by this point, I had become a master at regular expressions and was looking forward to a class that I knew some of the material going in.  Boy was I in for a shock.  When we got to the regular expressions section, it degenerated into a plethora of Greek letters and assorted jargon to the point where I truly loathed going to class.

Theory Should Not Be Taught at the Expense of Application

What I also realized in that CS class was that most of my fellow students may have passed the tests, they did not have any clue how to use regular expressions in real life, or why you would want to use them in the first place.  While we were spending time writing expressions that match ‘aaaaaaabababaaaaa‘ and drawing the automata that “implement” that, the knowledge of how to apply this to a real life problem, such as extracting data artifacts from raw data, was completely lost on the class.

What if the instructor had started the class by showing us this:

pattern = '([a-zA-Z0-9_.]+)@([a-zA-Z0-9_.]+\.\w{2,3})'
matchObj = re.match( text, pattern )
if matchObj:
email = matchObj.groups(0)
account = matchObj.groups(1)
domain = matchObj.groups(2)

If you’re not familiar, this brief example in python-esque pseudo code demonstrates how to match, and extract email addresses, accounts and domains from text.

I don’t think I’m saying anything new here, but too many technical classes both in academia and out, spend a disproportionate amount of time on the underlying theory, whilst simultaneously ignoring, or downplaying the actual application of the concepts being taught.  The result is that many students walk away frustrated, not understanding the actual use of what they are learning, and while professors and instructors may pat themselves on the back for preserving the “purity” of their curricula, I would argue that they have utterly failed in their task of educating their students.

The bottom line here, is that some people are really interested in theory, however for knowledge to be translated into something useful, students should be exposed early and often to a theory’s application and in conclusion, if you are designing some STEM training or a classes at a university don’t forget the importance of demonstrating how to apply the concepts you are teaching.

Leave a Comment

Teaching Data Science in English (not in Math)

chalkboardI spend most of my time now teaching others about data science and as such I do a lot of research into what is going on with respect to data science education.  As such I decided to take an online machine learning course and it led me to a serious question: why don’t we use pseudo-code to teach math concepts?

Consider the following:
34bd2b1ce9d35d34c115548ad24846fc

 

 

This is the formula for Residual Sum of Squares, which if you aren’t familiar, is a metric used to measure the effectiveness of regression models.

Now consider the following pseudo-code:

residuals_squared = (actual_values - predictions) ^ 2
RSS = sum( residuals_squared )

This example expresses the exact same concept and while it does take up more space on the page, in my mind at least, is much easier to understand.  I don’t have any empirical data to back this up, but I would suspect that many of you would agree.

Greek Letters are Jargon

Another thing I’ve realized is that part of the reason math becomes so difficult for people is that it is entirely taught in jargon, shorthand, and shorthand for shorthand.  The greek letter sigma represents a sum, but if you don’t know that then it represents confusion.  If you aren’t familiar with this formula, then the other Greek letters could be meaningless, yet if we used pseudocode, any part of this formula could be rewritten using English words (or any other language) and thus easily understood by anyone.

Crash Course in Machine Learning

I’m working on developing a short course in Machine Learning called Crash Course in Machine Learning which I will be teaching at the BlackHat conference in August.  I’m curious as to what people think about presenting algorithms using pseudo-code instead of math jargon.  I suspect it will make it easier for people to understand without diluting the rigor.

5 Comments