Advanced Optimizers Theory
Modern optimizers go beyond vanilla gradient descent by adapting learning rates, incorporating momentum, and using gradient statistics to achieve faster and more stable convergence. This chapter covers state-of-the-art optimization algorithms used in deep learning and machine learning.
Why Advanced Optimizers?
Standard SGD with momentum works well but has limitations:
-
Fixed learning rate: Same η for all parameters
- Problem: Different parameters may need different learning rates
- Example: Rare features need larger updates than frequent ones
-
Manual tuning required: Finding optimal η is time-consuming
- Grid search expensive
- Different datasets need different learning rates
-
Slow convergence: Without careful tuning, training can be slow
- Especially in non-convex landscapes
- High-dimensional parameter spaces
Solution: Adaptive optimizers that automatically adjust learning rates per parameter.
Optimizer Comparison Table
| Optimizer | Key Feature | Best For | Pros | Cons |
|---|---|---|---|---|
| SGD + Momentum | Velocity accumulation | General purpose | Simple, well-understood | Requires manual tuning |
| AdaGrad | Per-parameter lr | Sparse gradients | Adapts to data | lr decays too aggressively |
| RMSprop | Exponential moving average | RNNs, non-stationary | Fixes AdaGrad decay | No bias correction |
| Adam | Momentum + RMSprop | Deep learning (default) | Fast, robust | Can overfit on small data |
| AdamW | Adam + decoupled weight decay | Transformers | Better generalization | Slightly slower |
| Nadam | Adam + Nesterov momentum | Computer vision | Faster convergence | More complex |
AdaGrad: Adaptive Gradient Algorithm
Key idea: Accumulate squared gradients and divide learning rate by their square root, giving smaller updates to frequently updated parameters.
Algorithm
Initialize:
θ₀ = initial parameters
G₀ = 0 (accumulated squared gradients)
η = learning rate (typically 0.01)
ε = 1e-8 (numerical stability)
For t = 1, 2, ...
g_t = ∇L(θ_{t-1}) // Compute gradient
G_t = G_{t-1} + g_t ⊙ g_t // Accumulate squared gradients
θ_t = θ_{t-1} - η / √(G_t + ε) ⊙ g_t // Adaptive update
Where ⊙ denotes element-wise multiplication.
How It Works
Per-parameter learning rate:
η_i(t) = η / √(Σ(g_i^2) + ε)
s=1..t
- Frequently updated parameters → large accumulated gradient → small effective η
- Infrequently updated parameters → small accumulated gradient → large effective η
Example
Consider two parameters with gradients:
Parameter θ₁: Gradients = [10, 10, 10, 10] (frequent updates)
Parameter θ₂: Gradients = [1, 0, 0, 1] (sparse updates)
After 4 iterations (η = 0.1):
θ₁: G = 10² + 10² + 10² + 10² = 400
Effective η₁ = 0.1 / √400 = 0.1 / 20 = 0.005 (small)
θ₂: G = 1² + 0² + 0² + 1² = 2
Effective η₂ = 0.1 / √2 = 0.1 / 1.41 ≈ 0.071 (large)
Result: θ₂ gets ~14x larger updates despite having smaller gradients!
Advantages
- Automatic learning rate adaptation: No manual tuning per parameter
- Great for sparse data: NLP, recommender systems
- Handles different scales: Features with different ranges
Disadvantages
-
Learning rate decay: Accumulation never decreases
- Eventually η → 0, stopping learning
- Problem for deep learning (many iterations)
-
Requires careful initialization: Poor initial η can hurt performance
When to Use
- Sparse gradients: NLP (word embeddings), recommender systems
- Convex optimization: Guaranteed convergence for convex functions
- Short training: If iteration count is small
Not recommended for: Deep neural networks (use RMSprop or Adam instead)
RMSprop: Root Mean Square Propagation
Key idea: Fix AdaGrad's aggressive learning rate decay by using exponential moving average instead of sum.
Algorithm
Initialize:
θ₀ = initial parameters
v₀ = 0 (moving average of squared gradients)
η = learning rate (typically 0.001)
β = decay rate (typically 0.9)
ε = 1e-8
For t = 1, 2, ...
g_t = ∇L(θ_{t-1}) // Compute gradient
v_t = β·v_{t-1} + (1-β)·(g_t ⊙ g_t) // Exponential moving average
θ_t = θ_{t-1} - η / √(v_t + ε) ⊙ g_t // Adaptive update
Key Difference from AdaGrad
AdaGrad: G_t = G_{t-1} + g_t² (sum, always increasing)
RMSprop: v_t = β·v_{t-1} + (1-β)·g_t² (exponential moving average)
The exponential moving average forgets old gradients, allowing learning rate to increase again if recent gradients are small.
Effect of Decay Rate β
β = 0.9 (typical):
- Averages over ~10 iterations
- Balance between stability and adaptivity
β = 0.99:
- Averages over ~100 iterations
- More stable, slower adaptation
β = 0.5:
- Averages over ~2 iterations
- Fast adaptation, more noise
Advantages
- No learning rate decay problem: Can train indefinitely
- Works well for RNNs: Handles non-stationary problems
- Less sensitive to initialization: Compared to AdaGrad
Disadvantages
- No bias correction: Early iterations biased toward 0
- Still requires tuning: η and β hyperparameters
When to Use
- RNNs and LSTMs: Originally designed for this
- Non-stationary problems: Changing data distributions
- Deep learning: Better than AdaGrad for many epochs
Adam: Adaptive Moment Estimation
The most popular optimizer in modern deep learning. Combines the best ideas from momentum and RMSprop.
Core Concept
Adam maintains two moving averages:
- First moment (m): Exponential moving average of gradients (momentum)
- Second moment (v): Exponential moving average of squared gradients (RMSprop)
Algorithm
Initialize:
θ₀ = initial parameters
m₀ = 0 (first moment: mean gradient)
v₀ = 0 (second moment: uncentered variance)
η = 0.001 (learning rate)
β₁ = 0.9 (exponential decay for first moment)
β₂ = 0.999 (exponential decay for second moment)
ε = 1e-8
For t = 1, 2, ...
g_t = ∇L(θ_{t-1}) // Gradient
m_t = β₁·m_{t-1} + (1-β₁)·g_t // Update first moment
v_t = β₂·v_{t-1} + (1-β₂)·(g_t ⊙ g_t) // Update second moment
m̂_t = m_t / (1 - β₁^t) // Bias correction for m
v̂_t = v_t / (1 - β₂^t) // Bias correction for v
θ_t = θ_{t-1} - η · m̂_t / (√v̂_t + ε) // Parameter update
Why Bias Correction?
Initially, m and v are zero. Exponential moving averages are biased toward zero at the start.
Example (β₁ = 0.9, g₁ = 1.0):
Without bias correction:
m₁ = 0.9 × 0 + 0.1 × 1.0 = 0.1 (underestimates true mean of 1.0)
With bias correction:
m̂₁ = 0.1 / (1 - 0.9¹) = 0.1 / 0.1 = 1.0 (correct!)
The correction factor 1 - β^t approaches 1 as t increases, so correction matters most early in training.
Hyperparameters
Default values (from paper, work well in practice):
- η = 0.001: Learning rate (most important to tune)
- β₁ = 0.9: First moment decay (rarely changed)
- β₂ = 0.999: Second moment decay (rarely changed)
- ε = 1e-8: Numerical stability
Tuning guidelines:
- Start with defaults
- If unstable: reduce η to 0.0001
- If slow: increase η to 0.01
- Adjust β₁ for more/less momentum (rarely needed)
aprender Implementation
use aprender::optim::{Adam, Optimizer};
// Create Adam optimizer with default hyperparameters
let mut adam = Adam::new(0.001) // learning rate
.with_beta1(0.9) // optional: momentum coefficient
.with_beta2(0.999) // optional: RMSprop coefficient
.with_epsilon(1e-8); // optional: numerical stability
// Training loop
for epoch in 0..num_epochs {
for batch in data_loader {
// Forward pass
let predictions = model.predict(&batch.x);
let loss = loss_fn(predictions, batch.y);
// Compute gradients
let grads = compute_gradients(&model, &batch);
// Adam step (handles momentum + adaptive lr internally)
adam.step(&mut model.params, &grads);
}
}
Key methods:
Adam::new(η): Create with learning ratewith_beta1(β₁),with_beta2(β₂): Set moment decay ratesstep(&mut params, &grads): Perform one update stepreset(): Reset moment buffers (for multiple training runs)
Advantages
- Robust: Works well with default hyperparameters
- Fast convergence: Combines momentum + adaptive lr
- Memory efficient: Only 2x parameter memory (m and v)
- Well-studied: Extensive empirical validation
Disadvantages
- Can overfit: On small datasets or with insufficient regularization
- Generalization: Sometimes SGD with momentum generalizes better
- Memory overhead: 2x parameter count
When to Use
- Default choice: For most deep learning problems
- Fast prototyping: Converges quickly, minimal tuning
- Large-scale training: Handles high-dimensional problems well
When to avoid:
- Very small datasets (<1000 samples): Try SGD + momentum
- Need best generalization: Consider SGD with learning rate schedule
AdamW: Adam with Decoupled Weight Decay
Problem with Adam: Weight decay (L2 regularization) interacts badly with adaptive learning rates.
Solution: Decouple weight decay from gradient-based optimization.
Standard Adam with Weight Decay (Wrong)
g_t = ∇L(θ_{t-1}) + λ·θ_{t-1} // Add L2 penalty to gradient
// ... normal Adam update with modified gradient
Problem: Weight decay gets adapted by second moment estimate, weakening regularization.
AdamW (Correct)
// Normal Adam update (no λ in gradient)
m_t = β₁·m_{t-1} + (1-β₁)·g_t
v_t = β₂·v_{t-1} + (1-β₂)·(g_t ⊙ g_t)
m̂_t = m_t / (1 - β₁^t)
v̂_t = v_t / (1 - β₂^t)
// Separate weight decay step
θ_t = θ_{t-1} - η · (m̂_t / (√v̂_t + ε) + λ·θ_{t-1})
Weight decay applied directly to parameters, not through adaptive learning rates.
When to Use
- Transformers: Essential for BERT, GPT models
- Large models: Better generalization on big networks
- Transfer learning: Fine-tuning pre-trained models
Hyperparameters:
- Same as Adam, plus:
- λ = 0.01: Weight decay coefficient (typical)
Optimizer Selection Guide
Decision Tree
Start
│
├─ Need fast prototyping?
│ └─ YES → Adam (default: η=0.001)
│
├─ Training RNN/LSTM?
│ └─ YES → RMSprop (default: η=0.001, β=0.9)
│
├─ Working with transformers?
│ └─ YES → AdamW (η=0.001, λ=0.01)
│
├─ Sparse gradients (NLP embeddings)?
│ └─ YES → AdaGrad (η=0.01)
│
├─ Need best generalization?
│ └─ YES → SGD + momentum (η=0.1, β=0.9) + lr schedule
│
└─ Small dataset (<1000 samples)?
└─ YES → SGD + momentum (less overfitting)
Practical Recommendations
| Task | Recommended Optimizer | Learning Rate | Notes |
|---|---|---|---|
| Image classification (CNN) | Adam or SGD+momentum | 0.001 (Adam), 0.1 (SGD) | SGD often better final accuracy |
| NLP (word embeddings) | AdaGrad or Adam | 0.01 (AdaGrad), 0.001 (Adam) | AdaGrad for sparse features |
| RNN/LSTM | RMSprop or Adam | 0.001 | RMSprop traditional choice |
| Transformers | AdamW | 0.0001-0.001 | Essential for BERT, GPT |
| Small dataset | SGD + momentum | 0.01-0.1 | Less prone to overfitting |
| Reinforcement learning | Adam or RMSprop | 0.0001-0.001 | Non-stationary problem |
Learning Rate Schedules
Even with adaptive optimizers, learning rate schedules improve performance.
1. Step Decay
Reduce η by factor every K epochs:
let initial_lr = 0.001;
let decay_factor = 0.1;
let decay_epochs = 30;
for epoch in 0..num_epochs {
let lr = initial_lr * decay_factor.powi((epoch / decay_epochs) as i32);
let mut adam = Adam::new(lr);
// ... training
}
2. Exponential Decay
Smooth exponential reduction:
let initial_lr = 0.001;
let decay_rate = 0.96;
for epoch in 0..num_epochs {
let lr = initial_lr * decay_rate.powi(epoch as i32);
let mut adam = Adam::new(lr);
// ... training
}
3. Cosine Annealing
Smooth reduction following cosine curve:
use std::f32::consts::PI;
let lr_max = 0.001;
let lr_min = 0.00001;
let T_max = 100; // periods
for epoch in 0..num_epochs {
let lr = lr_min + 0.5 * (lr_max - lr_min) *
(1.0 + f32::cos(PI * (epoch as f32) / (T_max as f32)));
let mut adam = Adam::new(lr);
// ... training
}
4. Warm-up + Decay
Start small, increase, then decay (used in transformers):
fn learning_rate_schedule(step: usize, d_model: usize, warmup_steps: usize) -> f32 {
let d_model = d_model as f32;
let step = step as f32;
let warmup_steps = warmup_steps as f32;
let arg1 = 1.0 / step.sqrt();
let arg2 = step * warmup_steps.powf(-1.5);
d_model.powf(-0.5) * arg1.min(arg2)
}
Comparison: SGD vs Adam
When SGD + Momentum is Better
Advantages:
- Better final generalization (lower test error)
- Flatter minima (more robust to perturbations)
- Less memory (no moment estimates)
Requirements:
- Careful learning rate tuning
- Learning rate schedule essential
- More training time may be needed
When Adam is Better
Advantages:
- Faster initial convergence
- Minimal hyperparameter tuning
- Works across many problem types
Trade-offs:
- Can overfit more easily
- May find sharper minima
- Slightly worse generalization on some tasks
Empirical Rule
Adam for:
- Fast prototyping and experimentation
- Baseline models
- Large-scale problems (many parameters)
SGD + momentum for:
- Final production models (after tuning)
- When computational budget allows careful tuning
- Small to medium datasets
Debugging Optimizer Issues
Loss Not Decreasing
Possible causes:
- Learning rate too small
- Fix: Increase η by 10x
- Vanishing gradients
- Fix: Check gradient norms, adjust architecture
- Bug in gradient computation
- Fix: Use gradient checking
Loss Exploding (NaN)
Possible causes:
- Learning rate too large
- Fix: Reduce η by 10x
- Gradient explosion
- Fix: Gradient clipping, better initialization
Slow Convergence
Possible causes:
- Poor learning rate
- Fix: Try different optimizer (Adam if using SGD)
- No momentum
- Fix: Add momentum (β=0.9)
- Suboptimal batch size
- Fix: Try 32, 64, 128
Overfitting
Possible causes:
- Optimizer too aggressive (Adam on small data)
- Fix: Switch to SGD + momentum
- No regularization
- Fix: Add weight decay (AdamW)
aprender Optimizer Example
use aprender::optim::{Adam, SGD, Optimizer};
use aprender::linear_model::LogisticRegression;
use aprender::prelude::*;
// Example: Comparing Adam vs SGD
fn compare_optimizers(x_train: &Matrix<f32>, y_train: &Vector<i32>) {
// Optimizer 1: Adam (fast convergence)
let mut model_adam = LogisticRegression::new();
let mut adam = Adam::new(0.001);
println!("Training with Adam...");
for epoch in 0..50 {
let loss = train_epoch(&mut model_adam, x_train, y_train, &mut adam);
if epoch % 10 == 0 {
println!(" Epoch {}: Loss = {:.4}", epoch, loss);
}
}
// Optimizer 2: SGD + momentum (better generalization)
let mut model_sgd = LogisticRegression::new();
let mut sgd = SGD::new(0.1).with_momentum(0.9);
println!("\nTraining with SGD + Momentum...");
for epoch in 0..50 {
let loss = train_epoch(&mut model_sgd, x_train, y_train, &mut sgd);
if epoch % 10 == 0 {
println!(" Epoch {}: Loss = {:.4}", epoch, loss);
}
}
}
fn train_epoch<O: Optimizer>(
model: &mut LogisticRegression,
x: &Matrix<f32>,
y: &Vector<i32>,
optimizer: &mut O,
) -> f32 {
// Compute loss and gradients
let predictions = model.predict_proba(x);
let loss = compute_cross_entropy_loss(&predictions, y);
let grads = compute_gradients(model, x, y);
// Update parameters
optimizer.step(&mut model.coefficients_mut(), &grads);
loss
}
Further Reading
Seminal Papers:
- Adam: Kingma & Ba (2015). "Adam: A Method for Stochastic Optimization"
- AdamW: Loshchilov & Hutter (2019). "Decoupled Weight Decay Regularization"
- RMSprop: Hinton (unpublished, Coursera lecture)
- AdaGrad: Duchi et al. (2011). "Adaptive Subgradient Methods"
Practical Guides:
- Ruder, S. (2016). "An overview of gradient descent optimization algorithms"
- CS231n Stanford: Optimization notes
Related Chapters
- Gradient Descent Theory - Foundation for all optimizers
- Optimizer Demo - Visual comparison of SGD and Adam
- Regularized Regression - Coordinate descent alternative
Summary
| Optimizer | Core Innovation | When to Use | aprender Support |
|---|---|---|---|
| AdaGrad | Per-parameter learning rates | Sparse gradients, convex problems | Not yet (v0.5.0) |
| RMSprop | Exponential moving average of squared gradients | RNNs, non-stationary | Not yet (v0.5.0) |
| Adam | Momentum + RMSprop + bias correction | Default choice, deep learning | ✅ Implemented |
| AdamW | Adam + decoupled weight decay | Transformers, large models | Not yet (v0.5.0) |
Key Takeaways:
- Adam is the default for most deep learning: fast, robust, minimal tuning
- SGD + momentum often achieves better final accuracy with proper tuning
- Learning rate schedules improve all optimizers
- AdamW essential for training transformers
- Monitor convergence: Loss curves reveal optimizer issues
Modern optimizers dramatically accelerate machine learning by adapting learning rates automatically. Understanding their trade-offs enables choosing the right tool for each problem.