I’ve written about this before but as a technical CEO and Co-Founder, my days are usually filled with meetings of various types. My day starts with a daily standup about sales and growth and can take any number of directions. Mondays usually have sprint planning meetings, Tuesdays exec meetings, Thursdays are meetings with investors etc. The unfortunate result is that I don’t have large amounts of uninterrupted time for tech work and other work that requires intense concentration.
Burst Coding: Coding for Those With No Time
Given my insane schedule, if I’m going to do any kind of technical work, it means that I have to do it in VERY short increments of time. This approach flies in the face of commonly accepted approach of software development which is that developers need long amounts of uninterrupted time to be productive. Given that I don’t have long amounts of uninterrupted time, I had to develop a way to be productive and still sleep and spend time with my family. I call it Burst Coding and here’s how it works.
Firstly, let me say that I’ve always believed that the way to write really bad code and spend a lot of time doing it is to simply just dive right in and start coding. When I teach classes, I always encourage students to think about what they are trying to do before they actually start writing code. My new found position has forced me to do is exactly that, but to an extreme. What I’ve realized is that if I know I’ll only have 30 min of time to do development work, but I have a large project to work on, I’ll set very small incremental goals for myself. Then in that limited time, try to achieve one of those goals. When I’m not actively coding, I’m mentally trying to sort out exactly how to achieve this goal. This way, I may only spend a few minutes actually coding, but I can still tackle complex problems.
As an aside, if I have to work on something non-coding related, I’ve learned that I have to make sure my IDE is closed, lest I get sucked into development work. Anyway, I thought I’d share something that I’ve been working on as I think it is pretty cool.
I’ve Got a Model, Now What?
One of the major challenges in the data world is now known as MLOps, which Wikipedia defines as: “MLOps or ML Ops is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently.” From my perspective, one of the major holes in in the industry is model deployment. In other words, once you’ve built and trained an ML model, what do you do with it? How do you get it out of a Jupyter Notebook and put it into production?
For the last few years, I’ve been looking for better ways to deploy models once they’ve been trained. This topic actually is a bit of a sore one for me because a few years ago, I built a really effective model for detecting malicious administration activity on production servers. However, the company I was working for at the time had absolutely no way of deploying the model once it was built, so all my work effectively went in the bin, but that’s a story for another day.
Anyway, getting back to the model thing, I’ve always thought it would be an interesting idea for a user to be able to serialize a machine learning model and include the predictions in a SQL query. I even included a really poor example of this in Learning Apache Drill. The challenges in doing this are manifold:
- Most people who create machine learning models, do so in Python, R, Spark or a few other tools. For this to be practical you have to let people do the modeling using whatever tools they currently use.
- You have to be able to save not just the model, but the whole data pipeline going from raw data to features.
How do you save a model?
This is something I’ve been trying to figure out for some time. When I wrote Learning Apache Drill I had been following h2o’s machine learning libraries which had a way of saving models created in h2o, and then reusing them in different languages. IE: You could write a model using their libraries in Python and then use the h2o Java SDK to make predictions. I experimented with this, but the code was clunky and didn’t seem to lend itself well to what I was trying to do. I also found the MLleap project (https://github.com/combust/mleap) which seems to be defunct.
In the python ecosystem, the commonly taught approach is to pickle objects and then create docker-based micro services for this. There’s even a module called
scikit-deploy for this very purpose. From my perspective, it doesn’t seem like there is a widely accepted solution for this problem.
With all that said, a week ago, I stumbled on Predictive Model Markup Language (PMML) which is an XML based language for serializing ML models and pipelines. PMML has been around for a long time, but more importantly is that there exist modules for preserving models and pipelines in PMML. This solves the first part of the problem: saving the models.
How do you include a model in a query?
Ok, so let us stipulate that you can save a model. How do you do the next piece which is to actually pipe data through it and produce a result? I did some experiments with Drill and wrote some custom functions which allow a user to do just that. Basically, you can write a query like this:
SELECT ... predict('model.xml', feature1, feature2, feature3...) FROM <data>
The output for this function is a map with the predictions and probabilities.
Why would you want to do this?
Well… good question. The main reason that I was thinking was that if I as a data scientist want to allow others to use a model that I built I can do this. What’s more, is that on the DataDistillr platform, I could wrap all that in a tidy view and the non-technical user would be able to work the model’s output really easily. What’s more, is that you could also publish the model’s output via API so you could use it in Tableau or other tools. All of this without coding. What do you think? Good idea? Waste of time? Somewhere in between? Please let me know.