Skip to content

The Best Of Both Worlds: Joining Online And Local Datasets With Apache Drill is rapidly establishing itself as the premier site for data scientists and analysts to host and collaborate on datasets. I have been impressed with’s growth and interested in starting to use the platform in my professional projects.  On, datasets can be open and visible to the general public or they can be private, with visibility limited to select contributors. That is sufficient to guarantee the privacy of the data most of the time. However, in some cases, you may be explicitly prohibited from uploading data to the cloud.

Would it be possible to use in a project even when part of the data must not live in the cloud? 

It didn’t take me long to answer that question. Fortunately, I also have been doing a meaningful amount of experimentation and development with Apache Drill over the last few years. What impresses me about Drill is its versatility and potential to dramatically increase analytic productivity, open up previously inaccessible data sources, query across data silos, and do so with the common language of ANSI SQL.

As I began experimenting with both, I couldn’t help but wonder if it might be possible to somehow combine the two.

Well, it turns out, it is…

How does it work?

It is possible to query almost any JDBC data source directly using Drill and thanks to the development team, there is a JDBC driver which is compatible with Drill.  What this means is that once you set up Drill with the JDBC driver, you can query and join any dataset that is hosted on with any dataset that Drill can access — remote or local.

A Quick Demo

To try this out, I uploaded a dataset to which is a listing of MAC Addresses and the vendor prefix so that you can identify the manufacturer of a device from its MAC address.  If you aren’t familiar with MAC addresses, they are a hardware identifier which is assigned to network interfaces. The IEEE assigns first six digits of each MAC address which makes it possible to identify the manufacturer from the MAC address.  You can download a list of the vendors and their prefix codes from the IEEE, but it is not in an easy to analyze format, so I wrote a parser to convert this data into a CSV file which I then uploaded to  Once this is uploaded to, you can execute simple queries against this data such as the one below:

SELECT companyName 
FROM dw.cgivre.`mac-address-manufacturers`.`20170426mac_address.csv/20170426mac_address` 
WHERE country='CN'

The query above returns the company name from that dataset where the country code is ‘CN’ or China.  However, while this is somewhat interesting, let’s look at another example in which we join local data with the data hosted on  For this example, I am going to query a PCAP file that is locally hosted, extract the MAC addresses from the file, and join them with the MAC Address data hosted on data.word. The query was as follows:

SELECT SUBSTRING( REGEXP_REPLACE( MACAddressSource, ':', '' ),1,6 ) AS MacAddress, 
FROM dfs.test.`test1.pcap` AS p
JOIN dw.cgivre.`mac-address-manufacturers`.`20170426mac_address.csv/20170426mac_address` AS dw ON dw.prefix = SUBSTRING( REGEXP_REPLACE( MACAddressSource, ':', '' ),1,6 )

The results are shown below:

So with absolutely zero data prep, I was able to extract fields from raw data and merge that with data hosted on

Setting up Drill

In order to query with your Drill instance, you will first need to create a storage plugin for  To do this, open Drill’s web interface and click on the storage link at the top. Create a new storage plugin by clicking on the link at the bottom. Once you get a blank window, enter the following text:

 "type": "jdbc",
 "driver": "",
 "url": "jdbc:data:world:sql:<username>:<dataset>",
 "username": "<username>",
 "password": "<API Key>",
 "enabled": false

You will need to get an API Key from which is available here once you have an account. Save this plugin as dw.

Querying from Drill

Once you’ve set up the storage plugin, all you really will have to do to query a dataset is modify the FROM clause of your query.  The driver is a little different than a regular Drill FROMclause in that it has four parts whereas a traditional Drill data source only has three.


Thus to query the example data mentioned in this article, you could use the following FROM clause:


As this demonstrates, Apache Drill has once again proven to be a flexible and powerful tool for joining data from disparate datasources. This time, it allowed me to bring the wealth of data on to a context where it can be securely used in conjunction with local and highly sensitive data that otherwise would have been siloed.

Share the joy

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *