No More Coffee Breaks – Faster Hyperparameter Tuning in the Cloud

This post comes with a notebook to code along in – it works in your own environment. You can also launch it on the Coiled platform.

Try it on your own data, run it on any Dask cluster!

TL;DR

You’ve selected the best modeling approach – now it’s time to get the high score. What’s the best set of hyperparameters to use? Parallel hyperparameter optimization on a cluster can save you time and give you answers faster. 

Working with the Wine Quality dataset, we’ll optimize an XGBoost model using scikit-learn Grid Seach and Optuna. All of this on a cluster directly from your own environment!

Pursuing better performance

Most machine learning algorithms come with a decent set of knobs and levers – hyperparameters – to adjust their training procedure. They come with defaults, and veteran data practitioners have their go-to settings.

Finding the best configuration holds the promise of getting significantly better performance, so how do we strike the balance with resources spent on optimization and absolute performance?

In this post we’ll examine increasingly sophisticated approaches to solving that conundrum:

  • Grid Search – to infinity and beyond! Trying every combination.
  • Random Search – okay I don’t have all day, let’s buy a fixed amount of lottery tickets
  • Hyperband – let’s invest more in lottery tickets similar to those that are winning
  • Optuna – intelligent, algorithmic discovery of the hyperparameter space

Let’s find the best set of hyperparameters for an XGBoost model that predicts the quality of the wine based on 11 descriptive features [Cortez et al., 2009].

import pandas as pd
import numpy as np
Url = '.../wine-quality/winequality-white.csv'
wine = pd.read_csv(url, delimiter=';')
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import cross_val_score

estimator = XGBRegressor(objective='reg:squarederror')
-cross_val_score(
    estimator, X, y, 
    cv=5, scoring='neg_mean_absolute_error'
).mean()

# Out: 0.6171626137196118

Out of the box, the model learns to predict wine quality on a scale 1 to 10 with mean absolute error 0.617. Can we do better than that? How many options do we need to try? When should we stop looking?

Caveat: When NOT to optimize hyperparameters

The appropriate time for hyperparameter optimization is after we have selected one or more high-conviction models that work well on a smaller data sample. Finding the optimal set of hyperparameters for a half-baked solution is not a useful thing to do.

Framing the problem correctly, discovering useful features, and getting promising results on a modeling approach are good boxes to check before moving on to optimization.

Two resources worth checking out before scaling are these excellent notes by Tom Augspurger, and `dabl` a new Python library started by Andreas Mueller, which helps jumpstart work on ML problems!

Grid search: Infinite options, infinite time

Problems, and datasets, are unique. Large ensemble models have so many settings, there is no way of knowing if an initial setting is anywhere near optimum. Trying really is the only way.

Trying means evaluating a very large number of parameter combinations.

If we had enough computers, we could try all the interesting options at the same time! 

This is a compute-bound problem, as our data fits in ram but the processor keeps cranking out the various options. If you’ve ever done backtesting on financial time series data, yup, that’s the same type of situation.

Let’s launch this example grid search – starting with a list of parameter variants to combine and try:

params = {
    'max_depth': [3, 6, 8, 16],
    'min_child_weight': [3, 5, 10, 20, 30],
    'eta': [.3, .2, .1, .05, .01],
    'colsample_bytree': np.arange(.7, 1., .1),
    'sampling_method ': ['uniform', 'gradient_based'],
    'booster': ['gbtree', 'dart'],
    'grow_policy': ['depthwise', 'lossguide']
}

And running the scikit-learn helper. 

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(
    estimator=estimator,
    param_grid=params,
    scoring = 'neg_mean_absolute_error',
    n_jobs = -1,
    cv = 5
)
grid_search.fit(X, y)

This is a good time to go away and make dinner, as we’ll be trying..

5 folds for each of 3200 candidates, totaling 16000 fits.

That takes 35 minutes on my laptop (8-core CPU).

Not a great option, especially if I had more data, or a greater number of models to work on!

-grid_search.best_score_
# Out: 0.5683355409303652

Yay, better score!

Parallel Model Training in the Cloud with Dask

Dask.distributed comes to my aid – the library directly integrates with sklearn through its parallel backend, and allows me to send my work to a cluster.

There are many ways to set up a Dask cluster, one of them is Coiled. 

With Coiled, I can get a cluster of any scale in 2 minutes – at the moment I’m going to take advantage of the free beta limit of 100 cores and order 24 workers with 4 cores each. I’m going to do it directly from my Jupyter notebook.

import joblib
import coiled
from dask.distributed import Client
cluster = coiled.Cluster(n_workers=24, configuration='michal-mucha/xgb-action')
client = Client(cluster)

Next, I’ll use joblib – the scikit-learn parallel backend – to create a context that sends my data to all cluster workers and orders them to fit the variety of configurations:

with joblib.parallel_backend("dask", scatter=[X, y]):
    grid_search.fit(X, y)

Fitting 5 folds for each of 3200 candidates, totalling 16000 fits
[Parallel(n_jobs=-1)]: Using backend DaskDistributedBackend with 96 concurrent workers.

Checking in on the cluster dashboard, I can see the resources are as busy as they should be.

Coiled meticulously calculates (and estimates) cost – this exercise took just under 6 minutes and cost a total of $1.50. 

For Coiled. For you, it’s free!

Leading approaches to efficient hyperparameter optimization

When to stop the search?

As mentioned before, the number of combinations to try explodes very quickly as we add parameters with many possible options. Various approaches to the problem of optimization prescribe specific constraints, which rein in this exuberance. Here are some of them:

Random Search

For every hyperparameter, the data scientist provides either a distribution to sample random values from, or a list of values to sample from uniformly. What constrains the computation is the explicit number of allowed combinations to try. For example, we can order 100 random sets of values.

Hyperband

This novel algorithm represents the search as an “infinite-armed bandit”, referring to the casino analogy of “which is the best-paying slot machine to play”. It uses the same constraint as random search – a budget of iterations. The management of that budget is done much more consciously – winning combinations are tried more while losing combinations are abandoned early.

Hyperband supports a limited set of training models however, since it requires a model to expose the `partial_fit` method. Since XGBoost does not support it, we won’t use it in this example.

Bayesian Optimization

A sequential approach to optimization aimed at functions that are expensive to evaluate, Bayesian Optimization seeks to suggest the most valuable next set of trials.

Representing the output metric – for example, model accuracy – with a probabilistic function allows efficient search guided by reducing uncertainty.

Here, as in Hyperband and Random Search, we control how many iterations are computed.

Optuna, a popular Python library, implements a parameter sampler that leverages this technique and works out of the box.

For a deeper introduction, check out this wonderful introduction by Crissman Loomis, AI Engineer at Preferred Networks, the team leading the work on Optuna.

Optuna – Bayesian Optimization

Let’s give it a spin. A LocalCluster is enough – alternatively, try the hosted notebook on Coiled Cloud.

First, we need to define the objective function for Optuna to optimize. Within it, we will use the probabilistic sampler to get better suggestions for hyperparameter combinations:

def objective(trial):
    # Probabilistic sampler invocation on a parameter-by-parameter basis
    params = {
        "max_depth": trial.suggest_int("max_depth", 3, 16, log=True),
        "min_child_weight": trial.suggest_int("min_child_weight", 3, 30, log=True),
        "eta": trial.suggest_float("eta", 1e-8, 1.0, log=True),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.7, 1.0, log=True),
        "sampling_method ": trial.suggest_categorical(
            "sampling_method", ["uniform", "gradient_based"], log=True
        ),
        "booster": trial.suggest_categorical("booster", ["gbtree", "dart"]),
        "grow_policy": trial.suggest_categorical(
            "grow_policy", ["depthwise", "lossguide"]
        ),
    }
    
    # Intializing the model with hyperparameters
    estimator = XGBRegressor(objective="reg:squarederror", **params)
    
    score = cross_val_score(
        estimator, X, y, cv=5, scoring="neg_mean_absolute_error"
    ).mean()
    
    return score

Then, all that is left to do is to launch an Optuna study, using Dask as a parallel backend:

import dask_optuna
import joblib
import optuna

storage = dask_optuna.DaskStorage()

study = optuna.create_study(direction="maximize", storage=storage)
with joblib.parallel_backend("dask"):
    study.optimize(objective, n_trials=100, n_jobs=-1)

This guided approach produces a result very near to the optimal outcome from Grid Search, within just 100 trials.

-study.best_value
# Out: 0.569361444674037

Best trial marked in red.

The execution took just 1 minute 13 seconds on a 16-core local cluster on my laptop, illustrating how a state-of-the-art approach can save enormous amounts of computation.

When dealing with bigger datasets, or more expensive objective functions, the best approach is to combine a great algorithm with scalable compute.

For code examples and a fully-configured environment, you can try large-scale, distributed XGBoost hyperparameter optimization with Optuna and Dask in this notebook, also available in our public GitHub repository.

For larger datasets, try the Higgs dataset, composed of 11 million simulated particle collisions, which you can access in our S3 bucket at  s3://coiled-data/higgs/.

Scoreboard: Grid Search vs Bayesian Optimization

Best Score (MAE)Number of trials
Starting value0.617161
Grid Search0.5683316,000
Optuna0.56936100

What we could have done better

  • I’ve not done any work on framing the problem or error analysis. The model was trained on a regression task – would making it a classification task allow the model to learn better?
  • The grid search work can be called out as a bit unfair. By micromanaging the parameters and ranges to choose from, I could have gotten a pretty good result with fewer tries. I would have been doing Optuna’s work in that case!
  • Testing the approaches on more models, especially ones that support Hyperband, would make for a more interesting comparison.
  • After demonstrating how well Optuna handles hyperparameter optimization, a natural step would be to leverage a cluster to run the experiment on a much larger dataset.

Dask and Optuna for the win

A combination of Dask and Optuna makes hyperparameter training on big data much more approachable.

Going distributed is a great option to have when solving CPU-bound problems such as hyperparameter optimization – the other option is to wait longer!

Applying a smarter approach is a good remedy, but when working with larger datasets models, the ability to scale is irreplaceable.

What is really special about prototyping Dask code, is that talking to a local cluster feels the same as talking to an enormous remote cluster. We can make sure all the code works before commissioning a fleet of machines.

With a Coiled cluster, you have the choice to burst to the cloud directly from your coding environment. The code you run is the same code that works on your local machine. 

You can hire as many CPUs as you need, and release them as soon as you’re done. Keeping cost under total control while using a meaningful power boost.

Next steps

Some exciting ways to take this work further:

  • Represent the task as a classification problem.
  • Use a different machine learning model and compare Hyperband with Optuna.
  • Take the winning approach from a Kaggle competition such as this churn prediction challenge, and try to improve the score by tuning hyperparameters further.

Get the most out of your data science and ML

Thanks for reading! If you’re interested in taking Coiled Cloud for a spin, which provides hosted Dask clusters, docker-less managed software, and zero-click deployments, you can do so for free today when you click below.

Dataset citation:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Share

Sign up for updates