Googlesheets (GS) is one of those data sources that I think most data scientists use and probably dread a little. Using GS is easy enough, but what if a client gives you data in GS? Or worse, what if they have a lot of data in GS and other data that isn’t? Personally, whenever I’ve encountered data in GS, I’ve usually just downloaded it as a CSV and worked on it from there. This works fine, but if you have to do something that requires you to pull this data programmatically? This is where it gets a lot harder. This article will serve as both a rant and tutorial for anyone who is seeking to integrate GoogleSheets into their products.
I decided that it would be worthwhile to write a connector (plugin) for Apache Drill to enable Drill to read and write Google Sheets. After all, after Excel, they are probably one of the most common ways people store tabular data. We’ve really wanted to integrate GoogleSheets into DataDistillr as well, so this seemed like a worthy project. You can see this in action here:
So where to start?
Aha! You say… Google provides an API to access GS documents! So what’s the problem? The problem is that Google has designed what is quite possibly one of the worst SDKs I have ever seen. It is a real tour de force of terrible design, poor documentation, inconsistent naming conventions, and general WTF.
To say that this was an SDK designed by a committee is giving too much credit to committees. It’s more like a committee who spoke one language, hired a second committee which spoke another language to retain development teams which would never speak with each other to build out the API in fragments.
As you’ll see in a minute, the design decisions are horrible, but this goes far beyond bad design. The documentation, when it exists, is often sparse, incomplete, incoherent or just plain wrong. This really could be used as an exemplar of how not to design SDKs. I remarked to my colleague James who did the code review on the Drill side, that you can tell when I developed the various Drill components as the comments get snarkier and snarkier.
Let’s begin on this tour of awfulness.
I’ll begin the tour of the GS API as a tutorial of how to write Drill plugins. The first step is to connect and obtain schema information and in order to do anything you must first authenticate.
Authenticating
Let’s start with authentication. What better place to start? To work with GS, you have to create a Sheets
object and authenticate with it. GS uses that evil of all evil authentication methods: OAuth 2.0. Now, it is possible to authenticate using service accounts, however, this method is often not acceptable to companies, and we’ve really wanted to have all Drill plugins implement user translation. User translation basically means that users authenticate with their own credentials but use the same configuration. So we’re stuck with OAuth. The tricky part is that we wanted to have Drill store the tokens and pass them to the Google objects.
If you aren’t familiar with OAuth, you should consider yourself fortunate. Basically, there are four tokens that you have to track, a clientID
, client_secret
, access_token
and refresh_token
. The important part to understand is who owns what tokens. The client tokens belong to the application and the access and refresh tokens belong to the user. Drill has its own architecture for obtaining the tokens, securely storing them and mapping them to the user who submitted the query. The problem is that GS does not provide you with any way, other than their own library for creating a Sheets
object and supplying credentials.
Here’s where the fun begins.
You would think that there is some way to create a Sheets
object simply by passing it some form of credentials, like say a Credentials
object? So the code might look something like this:
// Doesn't work... don't try this.
Sheets sheetsClient = new Sheets(my_creds);
That would make sense, so naturally there isn’t such a method. After much digging, I was able to write the method below which gets you an authenticated Sheets
object. Here is the actual code for obtaining the Sheets
object.
public static Sheets getSheetsService(GoogleSheetsStoragePluginConfig config,
DataStore<StoredCredential> dataStore,
String queryUser)
throws IOException, GeneralSecurityException {
Credential credential = GoogleSheetsUtils.authorize(config, dataStore, queryUser);
return new Sheets.Builder(
GoogleNetHttpTransport.newTrustedTransport(), GsonFactory.getDefaultInstance(), credential)
.setApplicationName("Drill")
.build();
}
That doesn’t look so bad right? The problem arises in the line where we are obtaining the new Credential
object. Let’s take a look there…
/**
* Creates an authorized {@link Credential} for use in GoogleSheets queries.
* @param config The {@link GoogleSheetsStoragePluginConfig} to be authorized
* @param dataStore A {@link DrillDataStore} containing the user's tokens
* @param queryUser The current query user's ID. This should be set to anonymous if user translation is disabled.
* @return A validated {@link Credential} object.
* @throws IOException If anything goes wrong
* @throws GeneralSecurityException If the credentials are invalid
*/
public static Credential authorize(GoogleSheetsStoragePluginConfig config,
DataStore<StoredCredential> dataStore,
String queryUser) throws IOException, GeneralSecurityException {
GoogleClientSecrets clientSecrets = config.getSecrets();
GoogleAuthorizationCodeFlow flow;
List<String> scopes = Collections.singletonList(SheetsScopes.SPREADSHEETS);
if (dataStore == null) {
logger.debug("Datastore is null");
throw UserException.connectionError()
.message("The DrillDataStore is null. This should not happen.")
.build(logger);
} else if (dataStore.getDataStoreFactory() == null) {
logger.debug("Datastore factory is null");
throw UserException.connectionError()
.message("The DrillDataStoreFactory is null. This should not happen.")
.build(logger);
}
flow = new GoogleAuthorizationCodeFlow.Builder
(GoogleNetHttpTransport.newTrustedTransport(), JSON_FACTORY, clientSecrets, scopes)
.setDataStoreFactory(dataStore.getDataStoreFactory())
.setAccessType("offline")
.build();
return loadCredential(queryUser, flow, dataStore);
}
This code originates from a tutorial that I found somewhere, but the fundamental problem is that you can’t just construct a Credentials
object. There is all this other debris that you need in order to do so. This method ends by calling yet another method called loadCredentials
which actually takes all the information and finally populates a Credentials
object. Before I explain the “why this is bad”, let me first discuss the situation around the DataStore
objects.
Storing Your Tokens
The root problem that I encountered was that I needed a way to store the user tokens. Both Drill and the GS integrations have ways of handling this, so the challenge was to get them to work together. My original thinking was to use GS’s pieces exclusively, however, as you’ll see in a second, that proved to be impossible. When I first started working on this, I dug around on stackoverflow, and other places to see if anyone else had done what I was looking to do. What I found were a lot of angry posts, but no answers. I dug around in the code and found what looked like the answer: the StoredCredential
object. This object seemed to be what I was looking for in that it held the ephemeral user tokens. The problem was that the Sheets
object needed a Credential
object and not a StoredCredential
. No problem I thought.. Surely there is some way to convert a StoredCredential
into a Credential
right? After all, it should just be a “stored” version of a credential! Problem solved!!
Nope
After many hours of dropping f-bombs, I discovered the fundamental problem: The Credential
and StoredCredential
objects have nothing to do with each other. You read that correctly. They are completely unrelated. The Credential
object stores the application credentials and the StoredCredential
stores the user credentials. As a fundamental principal of clean coding, you should be able to tell what an object does by its name. Google could have easily named these objects something like ApplicationCredentialStore
or anything so that it would be obvious what they were used for, but they didn’t.
Improvement #1: Use better class names
I wish I could say that my problems ended there, but they didn’t. Not even close. Unfortunately this discovery didn’t get me any closer to actually solving the authentication issues.
Objects, Object, Objects Everywhere!
To solve those issues, we have to look at three other classes, the CredentialStore
, the DataStore
and DataStoreFactory
. I know you’re thinking… GET ON WITH IT, so I’ll do that. Basically, GS has abstractions for credential storage. When you create the Credential
object, it needs to have some sort of data store associated with it so that the ephemeral user tokens can be updated as needed. Now, of those three which one would you guess is the right one to use? WRONG!! It isn’t the CredentialStore
. That has been deprecated. Despite its deprecation, it is still all over the Google SDK. You would think that when Google releases a major version of the SDK, they would remove all the deprecated classes, but nope.. They’re still there. What I find equally interesting is that several are marked both @Beta
and @Deprecated
.
Anyway, you have to use the DataStore
. Fortunately, this is an abstract class and comes in several flavors, depending on how you want to store these tokens. There is a FileDataStore, MemoryDataStore and others. I realized at this point that I had to implement a DataStore that would integrate with Drill’s classes for storing credentials of which there are two: A CredentialsProvider
and a TokenTable
.
If you are curious, you can look here at the code. Now it isn’t enough to just implement the DataStore
, you also have to have a DataStoreFactory
which creates the bloody thing. All this for what is essentially a HashMap.
Improvement #2: Make it easier to create objects and eliminate superfluous ones
To this day, I’m not sure what the point of the factory object is but it’s there now…
Ok.. to end this tale of woe, I had implemented the data stores and all seemed like it should work, but it wasn’t. What the GS SDK does is if it can’t find active credentials, it outputs a link to the authentication page in the CLI and listens for the results. I suppose this is a good thing, but I was supplying the credentials from Drill, so we didn’t need this. Ultimately, I tracked the problem to this bit of code in the AuthorizationCodeFlow
class in the Google SDK.
@SuppressWarnings("deprecation")
public Credential loadCredential(String userId) throws IOException {
// No requests need to be performed when userId is not specified.
if (isNullOrEmpty(userId)) {
return null;
}
if (credentialDataStore == null && credentialStore == null) {
return null;
}
Credential credential = newCredential(userId);
The problem lies in the if statement which checks to see if the credentialDataStore
and credentialStore
are null. Logically this makes sense, however there seemed to be an issue in populating the credentialDataStore thus, this method would always return null. (I did this a few months ago and am not remembering exactly what the issue was here). The other issue is that it checks to see if the credentialStore
is null. This is the deprecated class that you shouldn’t be using in the first place. I ended up having to essentially override this method (and a few others) to get this to work.
In conclusion, authentication was way harder than it needed to be, and solely because the Google SDK was just so overloaded with Stores, Factories and other BS objects that really don’t need to exist. This whole SDK needs some serious cleanup to remove all the deprecated, and other nonsense that really makes this look like a bunch of interns developed this as a summer project.
Accessing the Files
Now that we have authenticated, the next step would be to actually open the file that you want to query. Ideally, to correctly populate the schemas, you need a list of all available Google Sheets documents. When Google released the v4 version of the GS SDK, in their infinite wisdom, they decided to remove the functionality that allowed you to get a list of available files. (Note… for simplicity’s sake, I will be referring to the sheets document as a file. This file can contain one or more sheets which is the actual data).
It is possible to get this information but it is located in the Drive
SDK and requires restricted permissions to do so. This is not a deal breaker, but it requires the user to use the fileID string to access individual files. The issue here is that this string is a non-human readable hash code (or something like that) AND it also means you can’t list what documents the user has access to. What’s frustrating here, is that the v3 API used to let you do this. So:
Improvement #3: Let developers list what Sheets Documents are available.
Now that we are able to connect to GS, the next task is to actually get a peek at the data to understand the structure, and here too, we find Google’s pattern of horrible design.
Figuring out the Schemata
To write a Drill storage plugin the first real step is to figure out the schema of the data that you are querying. When you execute a query, Drill uses a cost-based query planner called Calcite to figure out what is the optimal query plan. Usually, the goal is to push down as much of the compute to the data source. A simplistic way of thinking about it is let’s say that you are running the query below:
SELECT col1, col2
FROM <data>
WHERE col3 < 5
LIMIT 100
Drill could simply send a request to the data source for all available data, then filter it and pick out the requested columns. However, that is not efficient. The better way is to request only the data that is actually needed. So Drill could ask the data source, “Hey… give me col1, col2, but only where col3 < 5. Oh… and I only want 100 of these records”. This is known as a pushdown. The basic goal of all connectors is to push as much as possible down to the downstream system. The idea of pushdowns may seem very simple and it is for simple queries. However, it gets really complicated really quickly.
For our purposes, all we really need to do is get a list of columns and their data types, and this is where we had a whole bunch more fun.
All About A1, And I’m Not Talking Steak Sauce
Many spreadsheet-type programs use a notation called A1 notation to access parts of the spreadsheet. Microsoft actually has a pretty good explanation of A1 notation here. As an example, if you wanted the to get the first three columns in their entirety you could write a range: A:C.
Since SQL uses column names instead of positions, we had to write some code that figured out what columns were present and what their relative positions are. Since GS does not allow users to give their columns names, we assume that column names are present in the first row. But here we run into the metadata problem again and that is that the Google SDK doesn’t present any obvious way as to how to get the data type of the column in question. Now, I get that GS columns can contain mixed data, but still… there should be something, and there isn’t. Which leads us to:
Improvement #4: Give us some Metadata about the Columns.
I’ll spare you the details of how we do that, but let’s just say that we successfully obtained the column metadata. As mentioned earlier, I wanted to implement projection pushdown which basically means that we are only going to ask GS for the columns contained in the query. If you look at the MS examples, you’ll see this: Range("A:A,C:C,F:F")
which basically means that you are asking for the first column, the third column and the sixth column in the data. Here’s where I made another painful and undocumented discovery: GS does not allow you to have multiple ranges in one Range call like that. The painful discovery is there are two completely different ways of obtaining data from a GS document, depending on whether you have a single or multiple range call. The code below demonstrates the two methods.
The goal here is to get the data into a 2D table in the form of a List<List<Object>>
so that Drill can load this into memory vectors. If you have one range, this is really easy as the Google SDK does this and returns a list of rows. However, if you have multiple ranges, you get a list of columns with a ton of other debris.
public static List<List<Object>> getDataFromRange(Sheets service, String sheetID, String range) throws IOException {
return service.spreadsheets().values().get(sheetID, range).execute().getValues();
}
public static List<List<Object>> getBatchData(Sheets service, String sheetID, List<String> ranges) throws IOException {
logger.debug("Getting ranges: {}", ranges);
BatchGet request = service.spreadsheets().values().batchGet(sheetID).setRanges(ranges);
List<ValueRange> response = request.execute().getValueRanges();
List<List<Object>> results = new ArrayList<>();
// In Google's infinite wisdom when designing this API, the results
// are returned in a completely different fashion than when projection is not
// pushed down to Google Sheets. Specifically, if you use the regular values() to retrieve
// values from a GoogleSheet, you get a List of rows. Whereas if you use the BatchGet,
// you get a list of columns, sort of. Except these columns are embedded in a bunch of
// other debris from which you must extract the actual data.
//
// It should be noted that the GoogleSheets API does not accept multiple ranges in the
// request, so it is necessary to use the batch request.
for (int rowIndex = 0; rowIndex < ((ArrayList<?>) response.get(0).get("values")).size(); rowIndex++) {
List<Object> row = new ArrayList<>();
for (int colIndex = 0; colIndex < response.size(); colIndex++) {
try {
Object value = ((ArrayList<?>) ((ArrayList<?>) response.get(colIndex).get("values")).get(rowIndex)).get(0);
row.add(value);
} catch (IndexOutOfBoundsException | NullPointerException e) {
row.add(null);
}
}
results.add(row);
}
return results;
}
As someone who has spent a lot of time coding in Python, this is just so awful. The fact that I had to write these two methods just to get data into a usable form was really bad. It makes me wonder
In the early days, the Python community wrote PEP 20, The Zen of Python which outlines a philosophy of Python design. I couldn’t help but think of one item from PEP 20 which is:
There should be one-- and preferably only one --obvious way to do it.PEP-20
With the GS SDK, there are often zero obvious ways to do what I would imagine are common tasks, such as Authenticate. There also are tons of functionality that seems to have no clear way how to integrate. One thing I wanted to implement was a sort pushdown. After reading through the docs, I found the SortRangeRequest
which looked like it was what I would need to do that. Unfortunately, the documentation was so sparse that I just reached the point of saying I’m not spending any more time on this. The same goes for filters.
Improvements 5: Provide One, and Only One Obvious Way for Common Tasks.
Bonus Improvement: Update the Docs so They Are Correct and Include Code Samples
The bonus improvement was perhaps the most frustrating. I can understand how an early stage startup isn’t going to have the most up to date documentation. However, for a huge company like Google, it is really pretty pathetic. In addition to omissions, I also found errors in the docs, code samples that don’t work and all sorts of other f-bomb producing goodies.
So What Happened?
Well, in the end, I did write a storage plugin for GoogleSheets that works! You can both read from and write to Google Sheets from Drill. Since DataDistillr uses Drill as our query engine, we will be integrating this into our product in the next few weeks. Ultimately, what this means for DataDistillr customers is that you will be able to directly access data from GoogleSheets, query and join it with other data sources, and also write data back to Google Sheets.