Grid Search Hyperparameter Tuning

This example demonstrates grid search for finding optimal regularization hyperparameters using cross-validation with Ridge, Lasso, and ElasticNet regression.

Overview

Grid search is a systematic way to find the best hyperparameters by:

  1. Defining a grid of candidate values
  2. Evaluating each combination using cross-validation
  3. Selecting parameters that maximize CV score
  4. Retraining the final model with optimal parameters

Running the Example

cargo run --example grid_search_tuning

Key Concepts

Problem: Default hyperparameters rarely optimal for your specific dataset

Solution: Systematically search parameter space to find best values

Benefits:

  • Automated hyperparameter optimization
  • Cross-validation prevents overfitting
  • Reproducible model selection
  • Better generalization performance

Grid Search Process

  1. Define parameter grid: Range of values to try
  2. K-Fold CV: Split training data into K folds
  3. Evaluate: Train model on K-1 folds, validate on remaining fold
  4. Average scores: Mean performance across all K folds
  5. Select best: Parameters with highest CV score
  6. Final model: Retrain on all training data with best parameters
  7. Test: Evaluate on held-out test set

Examples Demonstrated

Example 1: Ridge Regression Alpha Tuning

Shows grid search for Ridge regression regularization strength (alpha):

Alpha Grid: [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]

Cross-Validation Scores:
  α=0.001  → R²=0.9510
  α=0.010  → R²=0.9510
  α=0.100  → R²=0.9510  ← Best
  α=1.000  → R²=0.9508
  α=10.000 → R²=0.9428
  α=100.000→ R²=0.8920

Best Parameters: α=0.100, CV Score=0.9510
Test Performance: R²=0.9626

Observation: Performance degrades with very large alpha (underf itting).

Example 2: Lasso Regression Alpha Tuning

Demonstrates grid search for Lasso with feature selection:

Alpha Grid: [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0]

Best Parameters: α=1.0000
Test Performance: R²=0.9628
Non-zero coefficients: 5/5 (sparse!)

Key Feature: Lasso performs automatic feature selection by driving some coefficients to exactly zero.

Alpha guidelines:

  • Too small: Overfitting (no regularization)
  • Optimal: Balance between fit and complexity
  • Too large: Underfitting (excessive regularization)

Example 3: ElasticNet with L1 Ratio Tuning

Shows 2D grid search over both alpha and l1_ratio:

Searching over:
  α: [0.001, 0.01, 0.1, 1.0, 10.0]
  l1_ratio: [0.25, 0.5, 0.75]

Best Parameters:
  α=1.000, l1_ratio=0.75
  CV Score: 0.9511

l1_ratio Parameter:

  • 0.0: Pure Ridge (L2 only)
  • 0.5: Equal mix of Lasso and Ridge
  • 1.0: Pure Lasso (L1 only)

When to use ElasticNet:

  • Many correlated features (Ridge component)
  • Want feature selection (Lasso component)
  • Best of both regularization types

Example 4: Visualizing Alpha vs Score

Compares Ridge and Lasso performance curves:

     Alpha      Ridge R²      Lasso R²
----------------------------------------
    0.0001        0.9510        0.9510
    0.0010        0.9510        0.9510
    0.0100        0.9510        0.9510
    0.1000        0.9510        0.9510
    1.0000        0.9508        0.9511
   10.0000        0.9428        0.9480
  100.0000        0.8920        0.8998

Observations:

  • Plateau region: Performance stable across small alphas
  • Ridge: Gradual degradation with large alpha
  • Lasso: Sharper drop after optimal point
  • Both: Performance collapses with excessive regularization

Example 5: Default vs Optimized Comparison

Demonstrates value of hyperparameter tuning:

Ridge Regression Comparison:

Default (α=1.0):
  Test R²: 0.9628

Grid Search Optimized (α=0.100):
  CV R²:   0.9510
  Test R²: 0.9626

→ Improvement or similar performance

Interpretation:

  • When default is good: Data well-suited to default parameters
  • When improvement significant: Dataset-specific tuning helps
  • Always worth checking: Small cost, potential large benefit

Implementation Details

Using grid_search_alpha()

use aprender::model_selection::{grid_search_alpha, KFold};

// Define parameter grid
let alphas = vec![0.001, 0.01, 0.1, 1.0, 10.0];

// Setup cross-validation
let kfold = KFold::new(5).with_random_state(42);

// Run grid search
let result = grid_search_alpha(
    "ridge",        // Model type
    &alphas,        // Parameter grid
    &x_train,       // Training features
    &y_train,       // Training targets
    &kfold,         // CV strategy
    None,           // l1_ratio (ElasticNet only)
).unwrap();

// Get best parameters
println!("Best alpha: {}", result.best_alpha);
println!("Best CV score: {}", result.best_score);

// Train final model
let mut model = Ridge::new(result.best_alpha);
model.fit(&x_train, &y_train).unwrap();

GridSearchResult Structure

pub struct GridSearchResult {
    pub best_alpha: f32,       // Optimal alpha value
    pub best_score: f32,       // Best CV score
    pub alphas: Vec<f32>,      // All alphas tried
    pub scores: Vec<f32>,      // Corresponding scores
}

Methods:

  • best_index(): Index of best alpha in grid

Best Practices

1. Define Appropriate Grid

// ✅ Good: Log-scale grid
let alphas = vec![0.001, 0.01, 0.1, 1.0, 10.0, 100.0];

// ❌ Bad: Linear grid missing optimal region
let alphas = vec![1.0, 2.0, 3.0, 4.0, 5.0];

Guideline: Use log-scale for regularization parameters.

2. Sufficient K-Folds

// ✅ Good: 5-10 folds typical
let kfold = KFold::new(5).with_random_state(42);

// ❌ Bad: Too few folds (unreliable estimates)
let kfold = KFold::new(2);

3. Evaluate on Test Set

// ✅ Correct workflow
let (x_train, x_test, y_train, y_test) = train_test_split(...);
let result = grid_search_alpha(..., &x_train, &y_train, ...);
let mut model = Ridge::new(result.best_alpha);
model.fit(&x_train, &y_train).unwrap();
let test_score = model.score(&x_test, &y_test); // Final evaluation

// ❌ Incorrect: Using CV score as final metric
println!("Final performance: {}", result.best_score); // Wrong!

4. Use Random State for Reproducibility

let kfold = KFold::new(5).with_random_state(42);
// Same results every run

Choosing Alpha Ranges

Ridge Regression

  • Start: [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
  • Refine: Zoom in on best region
  • Typical optimal: 0.1 - 10.0

Lasso Regression

  • Start: [0.0001, 0.001, 0.01, 0.1, 1.0]
  • Note: Usually needs smaller alphas than Ridge
  • Typical optimal: 0.001 - 1.0

ElasticNet

  • Alpha: Same as Ridge/Lasso
  • L1 ratio: [0.1, 0.3, 0.5, 0.7, 0.9] or [0.25, 0.5, 0.75]
  • Tip: Start with 3-5 l1_ratio values

Common Pitfalls

  1. Fitting grid search on all data: Always split train/test first
  2. Too fine grid: Computationally expensive, minimal benefit
  3. Ignoring CV variance: High variance suggests unstable model
  4. Overfitting to CV: Test set still needed for final validation
  5. Wrong scale: Linear grid misses optimal regions

Computational Cost

Formula: cost = n_alphas × n_folds × cost_per_fit

Example:

  • 6 alphas
  • 5 folds
  • Total fits: 6 × 5 = 30

Optimization:

  • Start with coarse grid
  • Refine around best region
  • Use fewer folds for very large datasets

Key Takeaways

  1. Grid search automates hyperparameter optimization
  2. Cross-validation provides unbiased performance estimates
  3. Log-scale grids work best for regularization parameters
  4. Ridge degrades gradually, Lasso more sensitive to alpha
  5. ElasticNet offers 2D tuning flexibility
  6. Always validate final model on held-out test set
  7. Reproducibility: Use random_state for consistent results
  8. Computational cost scales with grid size and K-folds