I’ve decided that since I really like Apache Drill, I’m going to start a page listing all the publicly available User Defined Functions (UDF) that I can find. If you have one, please email me and I’ll put it up here. I’ve been working on some that I will put on github shortly as well.
This set of UDFs (https://github.com/k255/drill-fuzzy-search) enables you to execute simple similarity and distance searches. It supports:
The installation instructions are posted on the github page. I did have a little trouble getting this to work, but the developer was responsive and updated the pom.xml file so it builds now without any problems.
I’m going to write a blog post about this but here are an amazing set of UDFs that enable geo-spatial queries using Drill. (https://github.com/k255/drill-gis) Here’s the beauty of this set of UDFs… it actually already ships with Drill so if you’ve been using Drill, you already have it!
What is supported:
- geometries with initial native binary representation (WKB)
- functions to create geometries ST_Point(lon, lat), ST_GeomFromText(wkt_text)
- spatial functions ST_DWithin(geom, geom, distance), ST_Within(geom, geom), ST_Distance(geom, geom)
- function to convert geometry from binary to text ST_AsText(geom)
- reprojection of coordinates from one spatial reference system to another ST_Transform(geom, srcSRID, tgtSRID)
The developer is still adding functionality, and not all is included in the Drill distro, but still… it’s pretty impressive.
This UDF is featured on Dremio’s blog and demonstrates how to write an aggregate function. If you aren’t familiar with Pearson’s R, it is a measure of how correlated two data sets are. Unfortunately, Dremio doesn’t include a repo with working code, so you’ll have to know how to build a maven project for this to make it work. None the less, this could be a very useful function to use in Drill.
Simple Sentiment Analysis
Also featured on Demio’s blog, but here is a detailed tutorial of how to create a UDF which performs simple sentiment analysis. This tutorial is quite thorough in demonstrating how to write a fairly sophisticated UDF. Unfortunately, there also isn’t a repo for this tutorial, but this tutorial does walk you through the complete process, so it is a good starting place.
Simple Drill Functions
These seem more like demo functions, but this repository was written by Ted Dunning and effectively demonstrates the capabilities of Drill’s UDF interface. It includes zip() which roughly mirrors python’s zip() function as well as a mask() function which masks n characters in a string.
This is a function which has been often requested to be integrated in Drill. The contains() function implements a regex match and uses Java’s regex library to match strings. Example usage is:
SELECT * FROM cp.`employee.json` WHERE contains(first_name,'She.*');
There’s not much in the way of documentation but, really there doesn’t need to be. Here’s the link: https://github.com/parisni/drill-simple-contains. I hope the Drill committers consider integrating this function into Drill, as it is really needed.
This batch of functions contains a series of functions necessary for network analysis. Here is the link: https://github.com/cgivre/drill-network-functions. The functions are:
inet_aton(<IPv4>): This function converts an IPv4 address in dotted decimal notation into an integer. This is useful for sorting IP addresses, and reducing the amount of space that they take on disk.
inet_ntoa(<int>): This function returns an IP in dotted decimal notation given its integer representation.
is_private_ip(<IPv4>): Returns true if the IP address is private.
in_network( <IPv4>, <CIDR Block>): Retunrs true if the IPv4 address is in the CIDR Block
getAddressCount( <CIDR Block> ): Returns the number of IP addresses in a given CIDR Block
getBroadcastAddress( <CIDR Block> ): Returns the broadcast address in dotted decimal notation from a given CIDR block.
getNetmask( <CIDR Block> ): Returns the netmask for a given CIDR Block
getLowAddress( <CIDR Block> ): Returns the first IPv4 address in dotted decimal notation for a given CIDR Block
getHighAddress( <CIDR Block> ): Returns the last IPv4 address in dotted decimal notation for a given CIDR Block