Regularization Theory

Chapter Status: ✅ 100% Working (All examples verified)

StatusCountExamples
✅ Working3Ridge, Lasso, ElasticNet verified
⏳ In Progress0-
⬜ Not Implemented0-

Last tested: 2025-11-19 Aprender version: 0.3.0 Test file: src/linear_model/mod.rs tests


Overview

Regularization prevents overfitting by adding a penalty for model complexity. Instead of just minimizing prediction error, we balance error against coefficient magnitude.

Key Techniques:

  • Ridge (L2): Shrinks all coefficients smoothly
  • Lasso (L1): Produces sparse models (some coefficients = 0)
  • ElasticNet: Combines L1 and L2 (best of both)

Why This Matters: "With great flexibility comes great responsibility." Complex models can memorize noise. Regularization keeps models honest by penalizing complexity.


Mathematical Foundation

The Regularization Principle

Ordinary Least Squares (OLS):

minimize: ||y - Xβ||²

Regularized Regression:

minimize: ||y - Xβ||² + penalty(β)

The penalty term controls model complexity. Different penalties → different behaviors.

Ridge Regression (L2 Regularization)

Objective Function:

minimize: ||y - Xβ||² + α||β||²₂

where:
||β||²₂ = β₁² + β₂² + ... + βₚ²  (sum of squared coefficients)
α ≥ 0 (regularization strength)

Closed-Form Solution:

β_ridge = (X^T X + αI)^(-1) X^T y

Key Properties:

  • Shrinkage: All coefficients shrink toward zero (but never reach exactly zero)
  • Stability: Adding αI to diagonal makes matrix invertible even when X^T X is singular
  • Smooth: Differentiable everywhere (good for gradient descent)

Lasso Regression (L1 Regularization)

Objective Function:

minimize: ||y - Xβ||² + α||β||₁

where:
||β||₁ = |β₁| + |β₂| + ... + |βₚ|  (sum of absolute values)

No Closed-Form Solution: Requires iterative optimization (coordinate descent)

Key Properties:

  • Sparsity: Forces some coefficients to exactly zero (feature selection)
  • Non-differentiable: At β = 0, requires special optimization
  • Variable selection: Automatically selects important features

ElasticNet (L1 + L2)

Objective Function:

minimize: ||y - Xβ||² + α[ρ||β||₁ + (1-ρ)||β||²₂]

where:
ρ ∈ [0, 1] (L1 ratio)
ρ = 1 → Pure Lasso
ρ = 0 → Pure Ridge

Key Properties:

  • Best of both: Sparsity (L1) + stability (L2)
  • Grouped selection: Tends to select/drop correlated features together
  • Two hyperparameters: α (overall strength), ρ (L1/L2 mix)

Implementation in Aprender

Example 1: Ridge Regression

use aprender::linear_model::Ridge;
use aprender::primitives::{Matrix, Vector};
use aprender::traits::Estimator;

// Training data
let x = Matrix::from_vec(5, 2, vec![
    1.0, 1.0,
    2.0, 1.0,
    3.0, 2.0,
    4.0, 3.0,
    5.0, 4.0,
]).unwrap();
let y = Vector::from_vec(vec![2.0, 3.0, 4.0, 5.0, 6.0]);

// Ridge with α = 1.0
let mut model = Ridge::new(1.0);
model.fit(&x, &y).unwrap();

let predictions = model.predict(&x);
let r2 = model.score(&x, &y);
println!("R² = {:.3}", r2); // e.g., 0.985

// Coefficients are shrunk compared to OLS
let coef = model.coefficients();
println!("Coefficients: {:?}", coef); // Smaller than OLS

Test Reference: src/linear_model/mod.rs::tests::test_ridge_simple_regression

Example 2: Lasso Regression (Sparsity)

use aprender::linear_model::Lasso;

// Same data as Ridge
let x = Matrix::from_vec(5, 3, vec![
    1.0, 0.1, 0.01,  // First feature important
    2.0, 0.2, 0.02,  // Second feature weak
    3.0, 0.1, 0.03,  // Third feature noise
    4.0, 0.3, 0.01,
    5.0, 0.2, 0.02,
]).unwrap();
let y = Vector::from_vec(vec![1.1, 2.2, 3.1, 4.3, 5.2]);

// Lasso with high α produces sparse model
let mut model = Lasso::new(0.5)
    .with_max_iter(1000)
    .with_tol(1e-4);

model.fit(&x, &y).unwrap();

let coef = model.coefficients();
// Some coefficients will be exactly 0.0 (sparsity!)
println!("Coefficients: {:?}", coef);
// e.g., [1.05, 0.0, 0.0] - only first feature selected

Test Reference: src/linear_model/mod.rs::tests::test_lasso_produces_sparsity

Example 3: ElasticNet (Combined)

use aprender::linear_model::ElasticNet;

let x = Matrix::from_vec(4, 2, vec![
    1.0, 2.0,
    2.0, 3.0,
    3.0, 4.0,
    4.0, 5.0,
]).unwrap();
let y = Vector::from_vec(vec![3.0, 5.0, 7.0, 9.0]);

// ElasticNet with α=1.0, l1_ratio=0.5 (50% L1, 50% L2)
let mut model = ElasticNet::new(1.0, 0.5)
    .with_max_iter(1000)
    .with_tol(1e-4);

model.fit(&x, &y).unwrap();

// Gets benefits of both: some sparsity + stability
let r2 = model.score(&x, &y);
println!("R² = {:.3}", r2);

Test Reference: src/linear_model/mod.rs::tests::test_elastic_net_simple


Choosing the Right Regularization

Decision Guide

Do you need feature selection (interpretability)?
├─ YES → Lasso or ElasticNet (L1 component)
└─ NO → Ridge (simpler, faster)

Are features highly correlated?
├─ YES → ElasticNet (avoids arbitrary selection)
└─ NO → Lasso (cleaner sparsity)

Is the problem well-conditioned?
├─ YES → All methods work
└─ NO (p > n, multicollinearity) → Ridge (always stable)

Do you want maximum simplicity?
├─ YES → Ridge (closed-form, one hyperparameter)
└─ NO → ElasticNet (two hyperparameters, more flexible)

Comparison Table

MethodPenaltySparsityStabilitySpeedUse Case
RidgeL2 (β²)
LassoL1 (β)
ElasticNetL1 + L2YesHighSlowerCorrelated features + selection

Hyperparameter Selection

The α Parameter

Too small (α → 0): No regularization → overfitting Too large (α → ∞): Over-regularization → underfitting (all β → 0) Just right: Balance bias-variance trade-off

Finding optimal α: Use cross-validation (see Cross-Validation Theory)

// Typical workflow (pseudocode)
for alpha in [0.001, 0.01, 0.1, 1.0, 10.0, 100.0] {
    model = Ridge::new(alpha);
    cv_score = cross_validate(model, x, y, k=5);
    // Select alpha with best cv_score
}

The l1_ratio Parameter (ElasticNet)

  • l1_ratio = 1.0: Pure Lasso (maximum sparsity)
  • l1_ratio = 0.0: Pure Ridge (maximum stability)
  • l1_ratio = 0.5: Balanced (common choice)

Grid Search: Try multiple (α, l1_ratio) pairs, select best via CV


Practical Considerations

Feature Scaling is CRITICAL

Problem: Ridge and Lasso penalize coefficients by magnitude

  • Features on different scales → unequal penalization
  • Large-scale feature gets less penalty than small-scale feature

Solution: Always standardize features before regularization

use aprender::preprocessing::StandardScaler;

let mut scaler = StandardScaler::new();
scaler.fit(&x_train);
let x_train_scaled = scaler.transform(&x_train);
let x_test_scaled = scaler.transform(&x_test);

// Now fit regularized model on scaled data
let mut model = Ridge::new(1.0);
model.fit(&x_train_scaled, &y_train).unwrap();

Intercept Not Regularized

Both Ridge and Lasso do not penalize the intercept. Why?

  • Intercept represents overall mean of target
  • Penalizing it would bias predictions
  • Implementation: Set fit_intercept=true (default)

Multicollinearity

Problem: When features are highly correlated, OLS becomes unstable Ridge Solution: Adding αI to X^T X guarantees invertibility Lasso Behavior: Arbitrarily picks one feature from correlated group


Verification Through Tests

Regularization models have comprehensive test coverage:

Ridge Tests (14 tests):

  • Closed-form solution correctness
  • Coefficients shrink as α increases
  • α = 0 recovers OLS
  • Multivariate regression
  • Save/load serialization

Lasso Tests (12 tests):

  • Sparsity property (some coefficients = 0)
  • Coordinate descent convergence
  • Soft-thresholding operator
  • High α → all coefficients → 0

ElasticNet Tests (15 tests):

  • l1_ratio = 1.0 behaves like Lasso
  • l1_ratio = 0.0 behaves like Ridge
  • Mixed penalty balances sparsity and stability

Test Reference: src/linear_model/mod.rs tests


Real-World Application

When Ridge Outperforms OLS

Scenario: Predicting house prices with 20 correlated features (size, bedrooms, bathrooms, etc.)

OLS Problem: High variance estimates, unstable predictions Ridge Solution: Shrinks correlated coefficients, reduces variance

Result: Lower test error despite higher bias

When Lasso Enables Interpretation

Scenario: Medical diagnosis with 1000 genetic markers, only ~10 relevant

Lasso Benefit: Selects sparse subset (e.g., 12 markers), rest → 0 Business Value: Cheaper tests (measure only 12 markers), interpretable model


Further Reading

Peer-Reviewed Papers

Tibshirani (1996) - Regression Shrinkage and Selection via the Lasso

  • Relevance: Original Lasso paper introducing L1 regularization
  • Link: JSTOR (publicly accessible)
  • Key Contribution: Proves Lasso produces sparse solutions
  • Applied in: src/linear_model/mod.rs Lasso implementation

Zou & Hastie (2005) - Regularization and Variable Selection via the Elastic Net

  • Relevance: Introduces ElasticNet combining L1 and L2
  • Link: JSTOR (publicly accessible)
  • Key Contribution: Solves Lasso's limitations with correlated features
  • Applied in: src/linear_model/mod.rs ElasticNet implementation

Summary

What You Learned:

  • ✅ Regularization = loss + penalty (bias-variance trade-off)
  • ✅ Ridge (L2): Shrinks all coefficients, closed-form, stable
  • ✅ Lasso (L1): Produces sparsity, feature selection, iterative
  • ✅ ElasticNet: Combines L1 + L2, best of both worlds
  • ✅ Feature scaling is MANDATORY for regularization
  • ✅ Hyperparameter tuning via cross-validation

Verification Guarantee: All regularization methods extensively tested (40+ tests) in src/linear_model/mod.rs. Tests verify mathematical properties (sparsity, shrinkage, equivalence).

Quick Reference:

  • Multicollinearity: Ridge
  • Feature selection: Lasso
  • Correlated features + selection: ElasticNet
  • Speed: Ridge (fastest)

Key Equation:

Ridge:      β = (X^T X + αI)^(-1) X^T y
Lasso:      minimize ||y - Xβ||² + α||β||₁
ElasticNet: minimize ||y - Xβ||² + α[ρ||β||₁ + (1-ρ)||β||²₂]

Next Chapter: Logistic Regression Theory

Previous Chapter: Linear Regression Theory