It’s been an interesting year both career wise and generally. This last year, I’ve had the amazing opportunity to speak at numerous conferences around the world, as well as give classes all over the world in data science and Apache Drill. I’ve also learned a lot about the internals of Drill and even contributed to the codebase. With that said, one can never rest on one’s laurels and as such I have a lot in store for the year.
Learning more about Business Development
I read a lot. I mean a lot. My reading pile used to be a lot of geek-books with animals on the cover, like the photo on the left. You can see such epic best-sellers such as Programming Hive, SQL for Mere Mortals, and Learning Spark. In any event, I’ve noticed that my reading queue (aka pile) on my desk has shifted somewhat from these epic tomes to books with a distinctly more business flavor. I don’t know if it was a conscious decision or not, but I’ve set the goal for this year to learn more about the business of analytics, and to do that, I clearly must learn about business. I even considered going back to school to get an MBA, but immediately reconsidered when I saw the cost. In lieu of that, I bought Personal MBA by Joel Kaufman and realized that while I do have a lot to learn about business, it really isn’t rocket surgery, and that getting an MBA probably wasn’t the best use of my time.
In so doing, I’ve realized that having a solid understanding of business is really useful for anyone seeking to get ahead as a data analyst or data scientist. I’ve also realized how much I have to learn on the subject. Ultimately, data science is about getting value from data, and in order to do so, you have to understand the business in which you work.
Therefore, it stands to reason, that learning as much about your business is a very good use of time in that it will help you understand your business’ problems better and enable you to identify opportunities to add value with data analytics. Personal MBA is upstairs at the moment, but here is my current book queue at right. You’ll note I still have some geek books in there, but they are definitely outnumbered by business related books.
Promoting Drill: AKA Killing ETL
I dislike the word evangelize, but I this year, I have no doubt that I will be spending a considerable amount of time promoting Apache Drill. I’m kind of passionate about this, and those who know me personally are probably sick of hearing about Drill (sorry guys) but my sense is that Drill and tools like it will really come to prominence in the years to come, because the solve a problem which nearly all businesses have: too much diverse data, with no easy way to analyze that data. I’ve written about that here.
I suspect that the reason that Drill has not caught on is that it needs more promotion and marketing. Over the course of the last few months, I’ve met a lot of people in various IT and analytic roles, and almost universally, when I ask them if they’ve heard of Apache Drill, the answer is “no”. Which leads me to my goal for the year:
I’m really excited to announce that I signed up with O’Reilly Publishing to host a series of online introductory Drill classes. The classes will be:
- February 22 & 23: Data Exploration with Apache Drill
- May 16 & 16: Data Exploration with Apache Drill
I intend also to develop more advanced classes about developing functions, and file plugins for Drill, but first things first. You can read more about my Drill workshops here. Please contact me if you are interested in having me conduct a Drill workshop at your company tailored to your needs.
Drill: The Book
I’ve also been working on the Drill book with Ellen Friedman, Ted Dunning and a few others and I decided that I’m making it my goal to get this book out the door this year. If Drill is going to be a success, it really needs a published book.
In researching Drill and the book for the last year, I’ve learned so much about Drill from digging through the source code. I know that for a tool to be successful, this can’t be the case. I’ve found that Drill has all sorts of functionality that is totally undocumented, such as GIS functionality. For Drill to succeed, this needs to change, so I’m committing myself to getting this project done.
I really see Drill as having enormous possibilities in the cyber-security realm in that there aren’t a whole lot of tools available to analyze the ever expanding datasets which are being generated by routers, firewalls, etc. The tools that do exist are clunky, expensive or both. In researching the book, I’ve learned a lot about how Drill actually works, and I’ve been working on developing extensions for Drill for cyber security purposes, some of which I’ve open-sourced:
- Drill Generic Log File Parser: Enables Drill to read any kind of log files
- Drill Network Functions: A bunch of functions which are useful for network analysis
- Drill User Agent Parser: I bet you’ll never guess what this does…
I’m also working on getting Drill to read PCAP and netflow data. My goals are to complete both by the end of the year.
Ok, Enough Drill… What else am I up to?
I’m always trying to learn new things and expand my toolset, and this year, I’ve decided to tackle graph databases. For two reasons… The first is that they fascinate me. Secondly, I believe that they are not very well understood (case in point…) and that many problems can be solved with a graph database IF you understand them. So if anyone has any suggestions for resources, I’m all ears. I already have a PacT book on Neo4j and the O’Reilly Graph Database book–which also uses Neo4j.
I’m also really interested in learning more about deep learning. I know that regrettably, it has become the buzzword du jour, but I do see a lot of potential use cases for it.
Lastly, how could I forget to mention that I’ll be teaching again at BlackHat! If you’re a security pro looking to learn some data science, register now!!