Regularization Theory
Chapter Status: ✅ 100% Working (All examples verified)
| Status | Count | Examples |
|---|---|---|
| ✅ Working | 3 | Ridge, Lasso, ElasticNet verified |
| ⏳ In Progress | 0 | - |
| ⬜ Not Implemented | 0 | - |
Last tested: 2025-11-19 Aprender version: 0.3.0 Test file: src/linear_model/mod.rs tests
Overview
Regularization prevents overfitting by adding a penalty for model complexity. Instead of just minimizing prediction error, we balance error against coefficient magnitude.
Key Techniques:
- Ridge (L2): Shrinks all coefficients smoothly
- Lasso (L1): Produces sparse models (some coefficients = 0)
- ElasticNet: Combines L1 and L2 (best of both)
Why This Matters: "With great flexibility comes great responsibility." Complex models can memorize noise. Regularization keeps models honest by penalizing complexity.
Mathematical Foundation
The Regularization Principle
Ordinary Least Squares (OLS):
minimize: ||y - Xβ||²
Regularized Regression:
minimize: ||y - Xβ||² + penalty(β)
The penalty term controls model complexity. Different penalties → different behaviors.
Ridge Regression (L2 Regularization)
Objective Function:
minimize: ||y - Xβ||² + α||β||²₂
where:
||β||²₂ = β₁² + β₂² + ... + βₚ² (sum of squared coefficients)
α ≥ 0 (regularization strength)
Closed-Form Solution:
β_ridge = (X^T X + αI)^(-1) X^T y
Key Properties:
- Shrinkage: All coefficients shrink toward zero (but never reach exactly zero)
- Stability: Adding αI to diagonal makes matrix invertible even when X^T X is singular
- Smooth: Differentiable everywhere (good for gradient descent)
Lasso Regression (L1 Regularization)
Objective Function:
minimize: ||y - Xβ||² + α||β||₁
where:
||β||₁ = |β₁| + |β₂| + ... + |βₚ| (sum of absolute values)
No Closed-Form Solution: Requires iterative optimization (coordinate descent)
Key Properties:
- Sparsity: Forces some coefficients to exactly zero (feature selection)
- Non-differentiable: At β = 0, requires special optimization
- Variable selection: Automatically selects important features
ElasticNet (L1 + L2)
Objective Function:
minimize: ||y - Xβ||² + α[ρ||β||₁ + (1-ρ)||β||²₂]
where:
ρ ∈ [0, 1] (L1 ratio)
ρ = 1 → Pure Lasso
ρ = 0 → Pure Ridge
Key Properties:
- Best of both: Sparsity (L1) + stability (L2)
- Grouped selection: Tends to select/drop correlated features together
- Two hyperparameters: α (overall strength), ρ (L1/L2 mix)
Implementation in Aprender
Example 1: Ridge Regression
use aprender::linear_model::Ridge;
use aprender::primitives::{Matrix, Vector};
use aprender::traits::Estimator;
// Training data
let x = Matrix::from_vec(5, 2, vec![
1.0, 1.0,
2.0, 1.0,
3.0, 2.0,
4.0, 3.0,
5.0, 4.0,
]).unwrap();
let y = Vector::from_vec(vec![2.0, 3.0, 4.0, 5.0, 6.0]);
// Ridge with α = 1.0
let mut model = Ridge::new(1.0);
model.fit(&x, &y).unwrap();
let predictions = model.predict(&x);
let r2 = model.score(&x, &y);
println!("R² = {:.3}", r2); // e.g., 0.985
// Coefficients are shrunk compared to OLS
let coef = model.coefficients();
println!("Coefficients: {:?}", coef); // Smaller than OLS
Test Reference: src/linear_model/mod.rs::tests::test_ridge_simple_regression
Example 2: Lasso Regression (Sparsity)
use aprender::linear_model::Lasso;
// Same data as Ridge
let x = Matrix::from_vec(5, 3, vec![
1.0, 0.1, 0.01, // First feature important
2.0, 0.2, 0.02, // Second feature weak
3.0, 0.1, 0.03, // Third feature noise
4.0, 0.3, 0.01,
5.0, 0.2, 0.02,
]).unwrap();
let y = Vector::from_vec(vec![1.1, 2.2, 3.1, 4.3, 5.2]);
// Lasso with high α produces sparse model
let mut model = Lasso::new(0.5)
.with_max_iter(1000)
.with_tol(1e-4);
model.fit(&x, &y).unwrap();
let coef = model.coefficients();
// Some coefficients will be exactly 0.0 (sparsity!)
println!("Coefficients: {:?}", coef);
// e.g., [1.05, 0.0, 0.0] - only first feature selected
Test Reference: src/linear_model/mod.rs::tests::test_lasso_produces_sparsity
Example 3: ElasticNet (Combined)
use aprender::linear_model::ElasticNet;
let x = Matrix::from_vec(4, 2, vec![
1.0, 2.0,
2.0, 3.0,
3.0, 4.0,
4.0, 5.0,
]).unwrap();
let y = Vector::from_vec(vec![3.0, 5.0, 7.0, 9.0]);
// ElasticNet with α=1.0, l1_ratio=0.5 (50% L1, 50% L2)
let mut model = ElasticNet::new(1.0, 0.5)
.with_max_iter(1000)
.with_tol(1e-4);
model.fit(&x, &y).unwrap();
// Gets benefits of both: some sparsity + stability
let r2 = model.score(&x, &y);
println!("R² = {:.3}", r2);
Test Reference: src/linear_model/mod.rs::tests::test_elastic_net_simple
Choosing the Right Regularization
Decision Guide
Do you need feature selection (interpretability)?
├─ YES → Lasso or ElasticNet (L1 component)
└─ NO → Ridge (simpler, faster)
Are features highly correlated?
├─ YES → ElasticNet (avoids arbitrary selection)
└─ NO → Lasso (cleaner sparsity)
Is the problem well-conditioned?
├─ YES → All methods work
└─ NO (p > n, multicollinearity) → Ridge (always stable)
Do you want maximum simplicity?
├─ YES → Ridge (closed-form, one hyperparameter)
└─ NO → ElasticNet (two hyperparameters, more flexible)
Comparison Table
| Method | Penalty | Sparsity | Stability | Speed | Use Case |
|---|---|---|---|---|---|
| Ridge | L2 ( | β | ²) | ||
| Lasso | L1 ( | β | ) | ||
| ElasticNet | L1 + L2 | Yes | High | Slower | Correlated features + selection |
Hyperparameter Selection
The α Parameter
Too small (α → 0): No regularization → overfitting Too large (α → ∞): Over-regularization → underfitting (all β → 0) Just right: Balance bias-variance trade-off
Finding optimal α: Use cross-validation (see Cross-Validation Theory)
// Typical workflow (pseudocode)
for alpha in [0.001, 0.01, 0.1, 1.0, 10.0, 100.0] {
model = Ridge::new(alpha);
cv_score = cross_validate(model, x, y, k=5);
// Select alpha with best cv_score
}
The l1_ratio Parameter (ElasticNet)
- l1_ratio = 1.0: Pure Lasso (maximum sparsity)
- l1_ratio = 0.0: Pure Ridge (maximum stability)
- l1_ratio = 0.5: Balanced (common choice)
Grid Search: Try multiple (α, l1_ratio) pairs, select best via CV
Practical Considerations
Feature Scaling is CRITICAL
Problem: Ridge and Lasso penalize coefficients by magnitude
- Features on different scales → unequal penalization
- Large-scale feature gets less penalty than small-scale feature
Solution: Always standardize features before regularization
use aprender::preprocessing::StandardScaler;
let mut scaler = StandardScaler::new();
scaler.fit(&x_train);
let x_train_scaled = scaler.transform(&x_train);
let x_test_scaled = scaler.transform(&x_test);
// Now fit regularized model on scaled data
let mut model = Ridge::new(1.0);
model.fit(&x_train_scaled, &y_train).unwrap();
Intercept Not Regularized
Both Ridge and Lasso do not penalize the intercept. Why?
- Intercept represents overall mean of target
- Penalizing it would bias predictions
- Implementation: Set
fit_intercept=true(default)
Multicollinearity
Problem: When features are highly correlated, OLS becomes unstable Ridge Solution: Adding αI to X^T X guarantees invertibility Lasso Behavior: Arbitrarily picks one feature from correlated group
Verification Through Tests
Regularization models have comprehensive test coverage:
Ridge Tests (14 tests):
- Closed-form solution correctness
- Coefficients shrink as α increases
- α = 0 recovers OLS
- Multivariate regression
- Save/load serialization
Lasso Tests (12 tests):
- Sparsity property (some coefficients = 0)
- Coordinate descent convergence
- Soft-thresholding operator
- High α → all coefficients → 0
ElasticNet Tests (15 tests):
- l1_ratio = 1.0 behaves like Lasso
- l1_ratio = 0.0 behaves like Ridge
- Mixed penalty balances sparsity and stability
Test Reference: src/linear_model/mod.rs tests
Real-World Application
When Ridge Outperforms OLS
Scenario: Predicting house prices with 20 correlated features (size, bedrooms, bathrooms, etc.)
OLS Problem: High variance estimates, unstable predictions Ridge Solution: Shrinks correlated coefficients, reduces variance
Result: Lower test error despite higher bias
When Lasso Enables Interpretation
Scenario: Medical diagnosis with 1000 genetic markers, only ~10 relevant
Lasso Benefit: Selects sparse subset (e.g., 12 markers), rest → 0 Business Value: Cheaper tests (measure only 12 markers), interpretable model
Further Reading
Peer-Reviewed Papers
Tibshirani (1996) - Regression Shrinkage and Selection via the Lasso
- Relevance: Original Lasso paper introducing L1 regularization
- Link: JSTOR (publicly accessible)
- Key Contribution: Proves Lasso produces sparse solutions
- Applied in:
src/linear_model/mod.rsLasso implementation
Zou & Hastie (2005) - Regularization and Variable Selection via the Elastic Net
- Relevance: Introduces ElasticNet combining L1 and L2
- Link: JSTOR (publicly accessible)
- Key Contribution: Solves Lasso's limitations with correlated features
- Applied in:
src/linear_model/mod.rsElasticNet implementation
Related Chapters
- Linear Regression Theory - OLS foundation
- Cross-Validation Theory - Hyperparameter tuning
- Feature Scaling Theory - CRITICAL for regularization
- Regression Metrics Theory - Evaluating regularized models
Summary
What You Learned:
- ✅ Regularization = loss + penalty (bias-variance trade-off)
- ✅ Ridge (L2): Shrinks all coefficients, closed-form, stable
- ✅ Lasso (L1): Produces sparsity, feature selection, iterative
- ✅ ElasticNet: Combines L1 + L2, best of both worlds
- ✅ Feature scaling is MANDATORY for regularization
- ✅ Hyperparameter tuning via cross-validation
Verification Guarantee: All regularization methods extensively tested (40+ tests) in src/linear_model/mod.rs. Tests verify mathematical properties (sparsity, shrinkage, equivalence).
Quick Reference:
- Multicollinearity: Ridge
- Feature selection: Lasso
- Correlated features + selection: ElasticNet
- Speed: Ridge (fastest)
Key Equation:
Ridge: β = (X^T X + αI)^(-1) X^T y
Lasso: minimize ||y - Xβ||² + α||β||₁
ElasticNet: minimize ||y - Xβ||² + α[ρ||β||₁ + (1-ρ)||β||²₂]
Next Chapter: Logistic Regression Theory
Previous Chapter: Linear Regression Theory