Imagine you have some process in your organization’s workflow that consumes 50%-90% of your staff’s time and contributes no value to the end result. If you work in the data science or data analytics fields you don’t have to imagine that because I’ve just described what is, in my view, the biggest problem in advanced analytics today: the Extract/Transform/Load (ETL) process. This range doesn’t come from thin air. Studies from a few years ago from various sources concluded that data scientists were spending between 50%-90% of their time preparing their data for analysis. (Example from Forbes, DatascienceCentral, New York Times) Furthermore, 76% of data scientists consider data preparation the least enjoyable part of their job, according to Forbes.
If you go to any trade show and walk the expo halls, you’ll see the latest whiz-bang tools to “leverage big data analytics in the cloud”, and you’ll be awed by some amazing visualization tools. Many of these new products can do amazing things… once they have data, which brings us back to our original problem…that in order to use the latest whiz-bang tool you still have to invest considerable amounts of time in data prep. The tools seem to skip that part, and focus on the final 10% of the process.
Surely if you look at most data science training programs today, most will have some initial materials focusing on the Python/Pandas/Scikit-Learn stack or the R equivalent, tidyverse, but in the end, I’ve never seen a program that has some section which really focuses on data prep. I would venture to suggest that most data science programs spend around 70-80% of their time teaching machine learning. So again there is a mismatch between what data scientists actually spend their time doing–the unsexy and difficult task of preparing data–and what data science programs focus on–the exciting machine learning part.
Some companies have gone so far as to hire dedicated ETL engineers to relieve their data scientists of this task, however this is not really an ideal solution either in that data scientists need to be intimately familiar with their data, especially ugly data. I’ve worked with many a data source that others said was useless or too complicated, that turned out to be very valuable. I was only able to discover this value because I was involved with the ETL process.
Fixing ETL
Really, the problem begins when data is first collected. Quite often, data analytics projects use data that was gathered for a completely different purpose and therefore the individuals who designed the system, often will not implement the collection and storage systems in a way that is conducive for analysis. Logically, one might conclude that the solution to this problem would be to “nip this in the bud” and impose some sort of enterprise-wide edict on how data is to be collected and stored so that the data scientists can maximize its use. However, this introduces all kinds of other problems because forcing a collection and storage mandate may unintentionally introduce all kinds of other inefficiencies or problems. Likewise, mandating that all data be stored in one type of system can dramatically increase costs and result in many other inefficiencies.
Drain the Data Lake
Another approach is the “data lake” concept. The theory behind a “data lake” is that all corporate data is stored in one massive repository–thereby eliminating data silos and providing a common access point so that analysts can unlock all kinds of value from this potential treasure trove of data. This approach is also fraught with problems. Firstly, simply dumping all corporate data into a single repository without human intervention can result in a data swamp, filled with inconsistent, corrupted data which can rapidly become useless. Michael Stonebreaker, the 2014 Turing Award winner wrote an interesting article in which he explains some of the other perils of this approach. (He also referred to data lakes as marketing bullshit in a 2015 presentation at the HP Big Data Conference. On a more practical level, implementing a data lake requires redesigning entire corporate infrastructure ($$) and then migrating all extant systems to use this system ($$). This approach also dramatically increases the complexity of IT infrastructure and can create a single point of failure (the data lake) which could be very destructive.
So what do you do about it?
I would argue that what is needed is a tool which can:
- Query data where it is: (MySQL, Oracle, Hadoop, MongoDB etc)
- Query data regardless of format (JSON, CSV, XML, Log Files, Whatever)
- Merge multiple disparate data sources and types quickly and easily
- Use a common, easily taught language such as SQL
- Scale and execute ad-hoc queries rapidly
- Perform basic analysis on this data (summaries, aggregates, etc.)
- Visualize the results
Such a tool would eliminate the need for costly and ineffective “data lakes”, and allow individual organizational units to maintain control of their data’s structure and use whilst simultaneously granting analysts access to this data to analyze for valuable insights. If only such a tool existed…
Well it does, sort of. Anyone who knows me, knows that I am a big fan of Apache Drill which I believe is one of the most promising tools that I have seen to conquer the ETL beast. Drill allows you to query self describing data, in almost any common format, where it is, without having to first define a schema or do anything really. Drill uses standard ANSI SQL which means that there is a small learning curve for analysts already versed in SQL. Mechanically, Drill is not a wrapper for MapReduce, but rather an in-memory columnar-store, so its performance is very good on both big and small datasets. Drill is not a database replacement and cannot execute queries with millisecond response time. It’s still pretty quick, but if you are serving webpages, MySQL is probably still the way to go. The only product I’ve seen which is remotely comparable to Drill is another equally unknown tool called Presto which is being developed by Facebook and Teradata.
Drill is still a relatively young tool in the big data ecosystem–the first stable release only came out last July–and there is still a lot of room for improvement, but I’m convinced that Drill and tools like it have enormous potential for greatly reducing ETL inefficiencies.
More than improving ETL
While Drill can dramatically improve the ETL process, the real advantage I see is that it can dramatically decrease the Time to Insight–the time it takes from when a data scientist receives data to the time it takes to actually produce something of value. Since data scientists are no longer wasting half of their time doing things that they don’t enjoy, it is only natural that their productivity will dramatically increase as well.
Great post! Thanks for sharing very powerful post.