Applied MLintermediate

Hyperparameter Tuning

“Automating the art of finding the right knobs to turn”

Grid Search (exhaustive), Random Search (surprisingly effective), Bayesian Optimisation (TPE/GP-based sequential search), Successive Halving, and Optuna — with interactive accuracy heatmap showing C × max_depth search space.

40 min

9 diagrams

7 Concepts Covered

Prerequisites

→Model Evaluation

→Gradient Boosting

Concepts Covered

GridSearchCVRandomizedSearchCVBayesian OptimisationOptunaSuccessive HalvingCV ScoreOverfitting to Validation

Previous: OvA vs OvO Multi-class Classification Next: Feature Importance & Selection

∑Key Formulas

Grid Search

Exhaustive search over all combinations in the predefined grid

Successive Halving

Progressively eliminate poor candidates, allocating more resources to promising ones

Expected Improvement

Bayesian Optimisation acquisition function — trades exploration vs exploitation

▶Interactive Simulation

Loading visualization…

🎯

Why Hyperparameters Matter

motivation

A random forest with max_depth=5 might score 0.72 AUC. The same algorithm with max_depth=12, min_samples_leaf=3, max_features='sqrt' scores 0.89 AUC. That 17-point gap is pure hyperparameter tuning — the algorithm didn't change, the data didn't change. Hyperparameters are parameters that are not learned from data; they control the learning process itself. Choosing them well is often the difference between a mediocre model and a production-ready one.

The learning rate is the single most important hyperparameter in most gradient-based models. Too high = divergence. Too low = slow convergence or local minima. Always tune it first.

⚖️

Grid vs Random vs Bayesian

comparison

Grid Search evaluates every combination in the Cartesian product of parameter values — correct but exponentially expensive (10 params × 5 values each = 5¹⁰ ≈ 10M evaluations). Random Search samples n_iter random combinations — surprisingly effective because most hyperparameter spaces have only a few dimensions that truly matter; random sampling covers them better than grids. Bayesian Optimization maintains a probabilistic model of the objective surface (Gaussian Process or Tree Parzen Estimator) and sequentially suggests configurations that maximize expected improvement — it learns from previous evaluations and focuses on promising regions.

Random Search with n_iter=60 typically outperforms Grid Search with 5× more evaluations. Bayesian Optimization outperforms both when evaluations are expensive (e.g., training a large neural net).

⚙️

Bayesian Optimisation Loop

algorithm

Fit a surrogate model (Gaussian Process) to previous (θ, score) observations

Use acquisition function (Expected Improvement, UCB) to select next θ

EI: explore where uncertainty is high OR where expected gain is high

Evaluate the actual objective: train model with θ, compute CV score

Add new observation to dataset, refit surrogate

Repeat until budget exhausted — return best θ found

🔬

Halving Search: Speed Without Sacrifice

deepdive

HalvingGridSearchCV and HalvingRandomSearchCV implement successive halving: start with all candidates but minimal resources (few training samples or estimators), keep the top η fraction, double the resources, repeat. A grid of 1024 candidates with 4 halving rounds needs only 1024×1 + 512×2 + 256×4 + 128×8 = 4096 total evaluations, vs 1024×all for standard GridSearchCV. This gives a 10–100× speedup for large grids with negligible quality loss.

For neural networks, use Keras Tuner or Optuna rather than sklearn's search — they support asynchronous parallel trials, early stopping integration, and neural-specific search spaces.

</>

All Three Methods in scikit-learn

code

python56 lines

from sklearn.model_selection import (GridSearchCV, RandomizedSearchCV,
                                       cross_val_score, train_test_split)
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingRandomSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from scipy.stats import uniform, randint
import optuna  # for Bayesian

# ── Sample data ────────────────────────────────────────────────────────
X, y = make_classification(n_samples=600, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# ── Parameter space ────────────────────────────────────────────────
param_grid = {
    'n_estimators':     [100, 200, 400],
    'max_depth':        [3, 5, 7, 9],
    'learning_rate':    [0.01, 0.05, 0.1, 0.2],
    'subsample':        [0.7, 0.8, 1.0],
    'min_samples_leaf': [1, 3, 5],
}

# ── 1. Grid Search (exhaustive, expensive) ─────────────────────────
gs = GridSearchCV(GradientBoostingClassifier(), param_grid,
                  cv=5, scoring='roc_auc', n_jobs=-1)
gs.fit(X_train, y_train)
print(f"Grid best: {gs.best_score_:.4f}  {gs.best_params_}")

# ── 2. Random Search (fast, almost-as-good) ────────────────────────
param_dist = {
    'n_estimators':     randint(50, 500),
    'max_depth':        randint(2, 12),
    'learning_rate':    uniform(0.005, 0.3),
    'subsample':        uniform(0.6, 0.4),
}
rs = RandomizedSearchCV(GradientBoostingClassifier(), param_dist,
                        n_iter=60, cv=5, scoring='roc_auc',
                        n_jobs=-1, random_state=42)
rs.fit(X_train, y_train)
print(f"Random best: {rs.best_score_:.4f}  {rs.best_params_}")

# ── 3. Optuna (Bayesian, best quality) ─────────────────────────────
def objective(trial):
    params = {
        'n_estimators':   trial.suggest_int('n_estimators', 50, 500),
        'max_depth':      trial.suggest_int('max_depth', 2, 12),
        'learning_rate':  trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
        'subsample':      trial.suggest_float('subsample', 0.5, 1.0),
    }
    model = GradientBoostingClassifier(**params)
    return cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc').mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, n_jobs=4)
print(f"Optuna best: {study.best_value:.4f}  {study.best_params}")

⚠️

Hyperparameter Tuning Pitfalls

pitfall

Tuning on the test set inflates performance estimates — always tune using only cross-validation on training data. Second: nested cross-validation is needed for unbiased estimation when both model selection and hyperparameter tuning are applied — the outer loop estimates generalization error, the inner loop selects hyperparameters. Third: the 'winner's curse' — with 1000 random configurations, the best one will be optimistic by random chance. Use a holdout set to verify the best configuration. Fourth: don't tune everything simultaneously — fix learning rate first, then regularization, then architecture.

Overfitting to the validation set is real. With enough hyperparameter trials, you will find a configuration that accidentally scores well on your CV folds but generalizes poorly. Always do a final evaluation on a truly held-out test set.

?Knowledge Check

Progress is saved in your browser — no account needed.

OvA vs OvO Multi-class Classification

Feature Importance & Selection

Need a Data Scientist or AI Engineer?

I build custom ML models, RAG chatbots, data pipelines, and production APIs — from analysis to deployment.

Get in touch View services