Skip to content

Month: May 2017

The Best Of Both Worlds: Joining Online And Local Datasets With Apache Drill

data.world is rapidly establishing itself as the premier site for data scientists and analysts to host and collaborate on datasets. I have been impressed with data.world’s growth and interested in starting to use the platform in my professional projects.  On data.world, datasets can be open and visible to the general public or they can be private, with visibility limited to select contributors. That is sufficient to guarantee the privacy of the data most of the time. However, in some cases, you may be explicitly prohibited from uploading data to the cloud.

Would it be possible to use data.world in a project even when part of the data must not live in the cloud? 

It didn’t take me long to answer that question. Fortunately, I also have been doing a meaningful amount of experimentation and development with Apache Drill over the last few years. What impresses me about Drill is its versatility and potential to dramatically increase analytic productivity, open up previously inaccessible data sources, query across data silos, and do so with the common language of ANSI SQL.

As I began experimenting with both, I couldn’t help but wonder if it might be possible to somehow combine the two.

Well, it turns out, it is…

Leave a Comment