Hyperparameter Tuning
βAutomating the art of finding the right knobs to turnβ
Grid Search (exhaustive), Random Search (surprisingly effective), Bayesian Optimisation (TPE/GP-based sequential search), Successive Halving, and Optuna β with interactive accuracy heatmap showing C Γ max_depth search space.
Prerequisites
Concepts Covered
βKey Formulas
Grid Search
Exhaustive search over all combinations in the predefined grid
Successive Halving
Progressively eliminate poor candidates, allocating more resources to promising ones
Expected Improvement
Bayesian Optimisation acquisition function β trades exploration vs exploitation
βΆInteractive Simulation
Why Hyperparameters Matter
A random forest with max_depth=5 might score 0.72 AUC. The same algorithm with max_depth=12, min_samples_leaf=3, max_features='sqrt' scores 0.89 AUC. That 17-point gap is pure hyperparameter tuning β the algorithm didn't change, the data didn't change. Hyperparameters are parameters that are not learned from data; they control the learning process itself. Choosing them well is often the difference between a mediocre model and a production-ready one.
The learning rate is the single most important hyperparameter in most gradient-based models. Too high = divergence. Too low = slow convergence or local minima. Always tune it first.
Grid vs Random vs Bayesian
Grid Search evaluates every combination in the Cartesian product of parameter values β correct but exponentially expensive (10 params Γ 5 values each = 5ΒΉβ° β 10M evaluations). Random Search samples n_iter random combinations β surprisingly effective because most hyperparameter spaces have only a few dimensions that truly matter; random sampling covers them better than grids. Bayesian Optimization maintains a probabilistic model of the objective surface (Gaussian Process or Tree Parzen Estimator) and sequentially suggests configurations that maximize expected improvement β it learns from previous evaluations and focuses on promising regions.
Random Search with n_iter=60 typically outperforms Grid Search with 5Γ more evaluations. Bayesian Optimization outperforms both when evaluations are expensive (e.g., training a large neural net).
Bayesian Optimisation Loop
Fit a surrogate model (Gaussian Process) to previous (ΞΈ, score) observations
Use acquisition function (Expected Improvement, UCB) to select next ΞΈ
EI: explore where uncertainty is high OR where expected gain is high
Evaluate the actual objective: train model with ΞΈ, compute CV score
Add new observation to dataset, refit surrogate
Repeat until budget exhausted β return best ΞΈ found
Halving Search: Speed Without Sacrifice
HalvingGridSearchCV and HalvingRandomSearchCV implement successive halving: start with all candidates but minimal resources (few training samples or estimators), keep the top Ξ· fraction, double the resources, repeat. A grid of 1024 candidates with 4 halving rounds needs only 1024Γ1 + 512Γ2 + 256Γ4 + 128Γ8 = 4096 total evaluations, vs 1024Γall for standard GridSearchCV. This gives a 10β100Γ speedup for large grids with negligible quality loss.
For neural networks, use Keras Tuner or Optuna rather than sklearn's search β they support asynchronous parallel trials, early stopping integration, and neural-specific search spaces.
All Three Methods in scikit-learn
from sklearn.model_selection import (GridSearchCV, RandomizedSearchCV, cross_val_score, train_test_split) from sklearn.experimental import enable_halving_search_cv # noqa from sklearn.model_selection import HalvingRandomSearchCV from sklearn.ensemble import GradientBoostingClassifier from sklearn.datasets import make_classification from scipy.stats import uniform, randint import optuna # for Bayesian # ββ Sample data ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ X, y = make_classification(n_samples=600, n_features=10, random_state=42) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) # ββ Parameter space ββββββββββββββββββββββββββββββββββββββββββββββββ param_grid = { 'n_estimators': [100, 200, 400], 'max_depth': [3, 5, 7, 9], 'learning_rate': [0.01, 0.05, 0.1, 0.2], 'subsample': [0.7, 0.8, 1.0], 'min_samples_leaf': [1, 3, 5], } # ββ 1. Grid Search (exhaustive, expensive) βββββββββββββββββββββββββ gs = GridSearchCV(GradientBoostingClassifier(), param_grid, cv=5, scoring='roc_auc', n_jobs=-1) gs.fit(X_train, y_train) print(f"Grid best: {gs.best_score_:.4f} {gs.best_params_}") # ββ 2. Random Search (fast, almost-as-good) ββββββββββββββββββββββββ param_dist = { 'n_estimators': randint(50, 500), 'max_depth': randint(2, 12), 'learning_rate': uniform(0.005, 0.3), 'subsample': uniform(0.6, 0.4), } rs = RandomizedSearchCV(GradientBoostingClassifier(), param_dist, n_iter=60, cv=5, scoring='roc_auc', n_jobs=-1, random_state=42) rs.fit(X_train, y_train) print(f"Random best: {rs.best_score_:.4f} {rs.best_params_}") # ββ 3. Optuna (Bayesian, best quality) βββββββββββββββββββββββββββββ def objective(trial): params = { 'n_estimators': trial.suggest_int('n_estimators', 50, 500), 'max_depth': trial.suggest_int('max_depth', 2, 12), 'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True), 'subsample': trial.suggest_float('subsample', 0.5, 1.0), } model = GradientBoostingClassifier(**params) return cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc').mean() study = optuna.create_study(direction='maximize') study.optimize(objective, n_trials=100, n_jobs=4) print(f"Optuna best: {study.best_value:.4f} {study.best_params}")
Hyperparameter Tuning Pitfalls
Tuning on the test set inflates performance estimates β always tune using only cross-validation on training data. Second: nested cross-validation is needed for unbiased estimation when both model selection and hyperparameter tuning are applied β the outer loop estimates generalization error, the inner loop selects hyperparameters. Third: the 'winner's curse' β with 1000 random configurations, the best one will be optimistic by random chance. Use a holdout set to verify the best configuration. Fourth: don't tune everything simultaneously β fix learning rate first, then regularization, then architecture.
Overfitting to the validation set is real. With enough hyperparameter trials, you will find a configuration that accidentally scores well on your CV folds but generalizes poorly. Always do a final evaluation on a truly held-out test set.
?Knowledge Check
Progress is saved in your browser β no account needed.
Need a Data Scientist or AI Engineer?
I build custom ML models, RAG chatbots, data pipelines, and production APIs β from analysis to deployment.