Advanced Optimizers Theory

Modern optimizers go beyond vanilla gradient descent by adapting learning rates, incorporating momentum, and using gradient statistics to achieve faster and more stable convergence. This chapter covers state-of-the-art optimization algorithms used in deep learning and machine learning.

Why Advanced Optimizers?

Standard SGD with momentum works well but has limitations:

  1. Fixed learning rate: Same η for all parameters

    • Problem: Different parameters may need different learning rates
    • Example: Rare features need larger updates than frequent ones
  2. Manual tuning required: Finding optimal η is time-consuming

    • Grid search expensive
    • Different datasets need different learning rates
  3. Slow convergence: Without careful tuning, training can be slow

    • Especially in non-convex landscapes
    • High-dimensional parameter spaces

Solution: Adaptive optimizers that automatically adjust learning rates per parameter.

Optimizer Comparison Table

OptimizerKey FeatureBest ForProsCons
SGD + MomentumVelocity accumulationGeneral purposeSimple, well-understoodRequires manual tuning
AdaGradPer-parameter lrSparse gradientsAdapts to datalr decays too aggressively
RMSpropExponential moving averageRNNs, non-stationaryFixes AdaGrad decayNo bias correction
AdamMomentum + RMSpropDeep learning (default)Fast, robustCan overfit on small data
AdamWAdam + decoupled weight decayTransformersBetter generalizationSlightly slower
NadamAdam + Nesterov momentumComputer visionFaster convergenceMore complex

AdaGrad: Adaptive Gradient Algorithm

Key idea: Accumulate squared gradients and divide learning rate by their square root, giving smaller updates to frequently updated parameters.

Algorithm

Initialize:
  θ₀ = initial parameters
  G₀ = 0  (accumulated squared gradients)
  η = learning rate (typically 0.01)
  ε = 1e-8 (numerical stability)

For t = 1, 2, ...
  g_t = ∇L(θ_{t-1})             // Compute gradient
  G_t = G_{t-1} + g_t ⊙ g_t      // Accumulate squared gradients
  θ_t = θ_{t-1} - η / √(G_t + ε) ⊙ g_t  // Adaptive update

Where denotes element-wise multiplication.

How It Works

Per-parameter learning rate:

η_i(t) = η / √(Σ(g_i^2) + ε)
                s=1..t
  • Frequently updated parameters → large accumulated gradient → small effective η
  • Infrequently updated parameters → small accumulated gradient → large effective η

Example

Consider two parameters with gradients:

Parameter θ₁: Gradients = [10, 10, 10, 10]  (frequent updates)
Parameter θ₂: Gradients = [1, 0, 0, 1]      (sparse updates)

After 4 iterations (η = 0.1):

θ₁: G = 10² + 10² + 10² + 10² = 400
    Effective η₁ = 0.1 / √400 = 0.1 / 20 = 0.005  (small)

θ₂: G = 1² + 0² + 0² + 1² = 2
    Effective η₂ = 0.1 / √2 = 0.1 / 1.41 ≈ 0.071  (large)

Result: θ₂ gets ~14x larger updates despite having smaller gradients!

Advantages

  1. Automatic learning rate adaptation: No manual tuning per parameter
  2. Great for sparse data: NLP, recommender systems
  3. Handles different scales: Features with different ranges

Disadvantages

  1. Learning rate decay: Accumulation never decreases

    • Eventually η → 0, stopping learning
    • Problem for deep learning (many iterations)
  2. Requires careful initialization: Poor initial η can hurt performance

When to Use

  • Sparse gradients: NLP (word embeddings), recommender systems
  • Convex optimization: Guaranteed convergence for convex functions
  • Short training: If iteration count is small

Not recommended for: Deep neural networks (use RMSprop or Adam instead)

RMSprop: Root Mean Square Propagation

Key idea: Fix AdaGrad's aggressive learning rate decay by using exponential moving average instead of sum.

Algorithm

Initialize:
  θ₀ = initial parameters
  v₀ = 0  (moving average of squared gradients)
  η = learning rate (typically 0.001)
  β = decay rate (typically 0.9)
  ε = 1e-8

For t = 1, 2, ...
  g_t = ∇L(θ_{t-1})                    // Compute gradient
  v_t = β·v_{t-1} + (1-β)·(g_t ⊙ g_t)  // Exponential moving average
  θ_t = θ_{t-1} - η / √(v_t + ε) ⊙ g_t  // Adaptive update

Key Difference from AdaGrad

AdaGrad: G_t = G_{t-1} + g_t² (sum, always increasing) RMSprop: v_t = β·v_{t-1} + (1-β)·g_t² (exponential moving average)

The exponential moving average forgets old gradients, allowing learning rate to increase again if recent gradients are small.

Effect of Decay Rate β

β = 0.9 (typical):
  - Averages over ~10 iterations
  - Balance between stability and adaptivity

β = 0.99:
  - Averages over ~100 iterations
  - More stable, slower adaptation

β = 0.5:
  - Averages over ~2 iterations
  - Fast adaptation, more noise

Advantages

  1. No learning rate decay problem: Can train indefinitely
  2. Works well for RNNs: Handles non-stationary problems
  3. Less sensitive to initialization: Compared to AdaGrad

Disadvantages

  1. No bias correction: Early iterations biased toward 0
  2. Still requires tuning: η and β hyperparameters

When to Use

  • RNNs and LSTMs: Originally designed for this
  • Non-stationary problems: Changing data distributions
  • Deep learning: Better than AdaGrad for many epochs

Adam: Adaptive Moment Estimation

The most popular optimizer in modern deep learning. Combines the best ideas from momentum and RMSprop.

Core Concept

Adam maintains two moving averages:

  1. First moment (m): Exponential moving average of gradients (momentum)
  2. Second moment (v): Exponential moving average of squared gradients (RMSprop)

Algorithm

Initialize:
  θ₀ = initial parameters
  m₀ = 0  (first moment: mean gradient)
  v₀ = 0  (second moment: uncentered variance)
  η = 0.001  (learning rate)
  β₁ = 0.9   (exponential decay for first moment)
  β₂ = 0.999 (exponential decay for second moment)
  ε = 1e-8

For t = 1, 2, ...
  g_t = ∇L(θ_{t-1})                     // Gradient

  m_t = β₁·m_{t-1} + (1-β₁)·g_t         // Update first moment
  v_t = β₂·v_{t-1} + (1-β₂)·(g_t ⊙ g_t)  // Update second moment

  m̂_t = m_t / (1 - β₁^t)                // Bias correction for m
  v̂_t = v_t / (1 - β₂^t)                // Bias correction for v

  θ_t = θ_{t-1} - η · m̂_t / (√v̂_t + ε)  // Parameter update

Why Bias Correction?

Initially, m and v are zero. Exponential moving averages are biased toward zero at the start.

Example (β₁ = 0.9, g₁ = 1.0):

Without bias correction:
  m₁ = 0.9 × 0 + 0.1 × 1.0 = 0.1  (underestimates true mean of 1.0)

With bias correction:
  m̂₁ = 0.1 / (1 - 0.9¹) = 0.1 / 0.1 = 1.0  (correct!)

The correction factor 1 - β^t approaches 1 as t increases, so correction matters most early in training.

Hyperparameters

Default values (from paper, work well in practice):

  • η = 0.001: Learning rate (most important to tune)
  • β₁ = 0.9: First moment decay (rarely changed)
  • β₂ = 0.999: Second moment decay (rarely changed)
  • ε = 1e-8: Numerical stability

Tuning guidelines:

  1. Start with defaults
  2. If unstable: reduce η to 0.0001
  3. If slow: increase η to 0.01
  4. Adjust β₁ for more/less momentum (rarely needed)

aprender Implementation

use aprender::optim::{Adam, Optimizer};

// Create Adam optimizer with default hyperparameters
let mut adam = Adam::new(0.001)  // learning rate
    .with_beta1(0.9)             // optional: momentum coefficient
    .with_beta2(0.999)           // optional: RMSprop coefficient
    .with_epsilon(1e-8);         // optional: numerical stability

// Training loop
for epoch in 0..num_epochs {
    for batch in data_loader {
        // Forward pass
        let predictions = model.predict(&batch.x);
        let loss = loss_fn(predictions, batch.y);

        // Compute gradients
        let grads = compute_gradients(&model, &batch);

        // Adam step (handles momentum + adaptive lr internally)
        adam.step(&mut model.params, &grads);
    }
}

Key methods:

  • Adam::new(η): Create with learning rate
  • with_beta1(β₁), with_beta2(β₂): Set moment decay rates
  • step(&mut params, &grads): Perform one update step
  • reset(): Reset moment buffers (for multiple training runs)

Advantages

  1. Robust: Works well with default hyperparameters
  2. Fast convergence: Combines momentum + adaptive lr
  3. Memory efficient: Only 2x parameter memory (m and v)
  4. Well-studied: Extensive empirical validation

Disadvantages

  1. Can overfit: On small datasets or with insufficient regularization
  2. Generalization: Sometimes SGD with momentum generalizes better
  3. Memory overhead: 2x parameter count

When to Use

  • Default choice: For most deep learning problems
  • Fast prototyping: Converges quickly, minimal tuning
  • Large-scale training: Handles high-dimensional problems well

When to avoid:

  • Very small datasets (<1000 samples): Try SGD + momentum
  • Need best generalization: Consider SGD with learning rate schedule

AdamW: Adam with Decoupled Weight Decay

Problem with Adam: Weight decay (L2 regularization) interacts badly with adaptive learning rates.

Solution: Decouple weight decay from gradient-based optimization.

Standard Adam with Weight Decay (Wrong)

g_t = ∇L(θ_{t-1}) + λ·θ_{t-1}  // Add L2 penalty to gradient
// ... normal Adam update with modified gradient

Problem: Weight decay gets adapted by second moment estimate, weakening regularization.

AdamW (Correct)

// Normal Adam update (no λ in gradient)
m_t = β₁·m_{t-1} + (1-β₁)·g_t
v_t = β₂·v_{t-1} + (1-β₂)·(g_t ⊙ g_t)
m̂_t = m_t / (1 - β₁^t)
v̂_t = v_t / (1 - β₂^t)

// Separate weight decay step
θ_t = θ_{t-1} - η · (m̂_t / (√v̂_t + ε) + λ·θ_{t-1})

Weight decay applied directly to parameters, not through adaptive learning rates.

When to Use

  • Transformers: Essential for BERT, GPT models
  • Large models: Better generalization on big networks
  • Transfer learning: Fine-tuning pre-trained models

Hyperparameters:

  • Same as Adam, plus:
  • λ = 0.01: Weight decay coefficient (typical)

Optimizer Selection Guide

Decision Tree

Start
  │
  ├─ Need fast prototyping?
  │    └─ YES → Adam (default: η=0.001)
  │
  ├─ Training RNN/LSTM?
  │    └─ YES → RMSprop (default: η=0.001, β=0.9)
  │
  ├─ Working with transformers?
  │    └─ YES → AdamW (η=0.001, λ=0.01)
  │
  ├─ Sparse gradients (NLP embeddings)?
  │    └─ YES → AdaGrad (η=0.01)
  │
  ├─ Need best generalization?
  │    └─ YES → SGD + momentum (η=0.1, β=0.9) + lr schedule
  │
  └─ Small dataset (<1000 samples)?
       └─ YES → SGD + momentum (less overfitting)

Practical Recommendations

TaskRecommended OptimizerLearning RateNotes
Image classification (CNN)Adam or SGD+momentum0.001 (Adam), 0.1 (SGD)SGD often better final accuracy
NLP (word embeddings)AdaGrad or Adam0.01 (AdaGrad), 0.001 (Adam)AdaGrad for sparse features
RNN/LSTMRMSprop or Adam0.001RMSprop traditional choice
TransformersAdamW0.0001-0.001Essential for BERT, GPT
Small datasetSGD + momentum0.01-0.1Less prone to overfitting
Reinforcement learningAdam or RMSprop0.0001-0.001Non-stationary problem

Learning Rate Schedules

Even with adaptive optimizers, learning rate schedules improve performance.

1. Step Decay

Reduce η by factor every K epochs:

let initial_lr = 0.001;
let decay_factor = 0.1;
let decay_epochs = 30;

for epoch in 0..num_epochs {
    let lr = initial_lr * decay_factor.powi((epoch / decay_epochs) as i32);
    let mut adam = Adam::new(lr);
    // ... training
}

2. Exponential Decay

Smooth exponential reduction:

let initial_lr = 0.001;
let decay_rate = 0.96;

for epoch in 0..num_epochs {
    let lr = initial_lr * decay_rate.powi(epoch as i32);
    let mut adam = Adam::new(lr);
    // ... training
}

3. Cosine Annealing

Smooth reduction following cosine curve:

use std::f32::consts::PI;

let lr_max = 0.001;
let lr_min = 0.00001;
let T_max = 100; // periods

for epoch in 0..num_epochs {
    let lr = lr_min + 0.5 * (lr_max - lr_min) *
             (1.0 + f32::cos(PI * (epoch as f32) / (T_max as f32)));
    let mut adam = Adam::new(lr);
    // ... training
}

4. Warm-up + Decay

Start small, increase, then decay (used in transformers):

fn learning_rate_schedule(step: usize, d_model: usize, warmup_steps: usize) -> f32 {
    let d_model = d_model as f32;
    let step = step as f32;
    let warmup_steps = warmup_steps as f32;

    let arg1 = 1.0 / step.sqrt();
    let arg2 = step * warmup_steps.powf(-1.5);

    d_model.powf(-0.5) * arg1.min(arg2)
}

Comparison: SGD vs Adam

When SGD + Momentum is Better

Advantages:

  • Better final generalization (lower test error)
  • Flatter minima (more robust to perturbations)
  • Less memory (no moment estimates)

Requirements:

  • Careful learning rate tuning
  • Learning rate schedule essential
  • More training time may be needed

When Adam is Better

Advantages:

  • Faster initial convergence
  • Minimal hyperparameter tuning
  • Works across many problem types

Trade-offs:

  • Can overfit more easily
  • May find sharper minima
  • Slightly worse generalization on some tasks

Empirical Rule

Adam for:

  • Fast prototyping and experimentation
  • Baseline models
  • Large-scale problems (many parameters)

SGD + momentum for:

  • Final production models (after tuning)
  • When computational budget allows careful tuning
  • Small to medium datasets

Debugging Optimizer Issues

Loss Not Decreasing

Possible causes:

  1. Learning rate too small
    • Fix: Increase η by 10x
  2. Vanishing gradients
    • Fix: Check gradient norms, adjust architecture
  3. Bug in gradient computation
    • Fix: Use gradient checking

Loss Exploding (NaN)

Possible causes:

  1. Learning rate too large
    • Fix: Reduce η by 10x
  2. Gradient explosion
    • Fix: Gradient clipping, better initialization

Slow Convergence

Possible causes:

  1. Poor learning rate
    • Fix: Try different optimizer (Adam if using SGD)
  2. No momentum
    • Fix: Add momentum (β=0.9)
  3. Suboptimal batch size
    • Fix: Try 32, 64, 128

Overfitting

Possible causes:

  1. Optimizer too aggressive (Adam on small data)
    • Fix: Switch to SGD + momentum
  2. No regularization
    • Fix: Add weight decay (AdamW)

aprender Optimizer Example

use aprender::optim::{Adam, SGD, Optimizer};
use aprender::linear_model::LogisticRegression;
use aprender::prelude::*;

// Example: Comparing Adam vs SGD
fn compare_optimizers(x_train: &Matrix<f32>, y_train: &Vector<i32>) {
    // Optimizer 1: Adam (fast convergence)
    let mut model_adam = LogisticRegression::new();
    let mut adam = Adam::new(0.001);

    println!("Training with Adam...");
    for epoch in 0..50 {
        let loss = train_epoch(&mut model_adam, x_train, y_train, &mut adam);
        if epoch % 10 == 0 {
            println!("  Epoch {}: Loss = {:.4}", epoch, loss);
        }
    }

    // Optimizer 2: SGD + momentum (better generalization)
    let mut model_sgd = LogisticRegression::new();
    let mut sgd = SGD::new(0.1).with_momentum(0.9);

    println!("\nTraining with SGD + Momentum...");
    for epoch in 0..50 {
        let loss = train_epoch(&mut model_sgd, x_train, y_train, &mut sgd);
        if epoch % 10 == 0 {
            println!("  Epoch {}: Loss = {:.4}", epoch, loss);
        }
    }
}

fn train_epoch<O: Optimizer>(
    model: &mut LogisticRegression,
    x: &Matrix<f32>,
    y: &Vector<i32>,
    optimizer: &mut O,
) -> f32 {
    // Compute loss and gradients
    let predictions = model.predict_proba(x);
    let loss = compute_cross_entropy_loss(&predictions, y);
    let grads = compute_gradients(model, x, y);

    // Update parameters
    optimizer.step(&mut model.coefficients_mut(), &grads);

    loss
}

Further Reading

Seminal Papers:

  • Adam: Kingma & Ba (2015). "Adam: A Method for Stochastic Optimization"
  • AdamW: Loshchilov & Hutter (2019). "Decoupled Weight Decay Regularization"
  • RMSprop: Hinton (unpublished, Coursera lecture)
  • AdaGrad: Duchi et al. (2011). "Adaptive Subgradient Methods"

Practical Guides:

  • Ruder, S. (2016). "An overview of gradient descent optimization algorithms"
  • CS231n Stanford: Optimization notes

Summary

OptimizerCore InnovationWhen to Useaprender Support
AdaGradPer-parameter learning ratesSparse gradients, convex problemsNot yet (v0.5.0)
RMSpropExponential moving average of squared gradientsRNNs, non-stationaryNot yet (v0.5.0)
AdamMomentum + RMSprop + bias correctionDefault choice, deep learning✅ Implemented
AdamWAdam + decoupled weight decayTransformers, large modelsNot yet (v0.5.0)

Key Takeaways:

  1. Adam is the default for most deep learning: fast, robust, minimal tuning
  2. SGD + momentum often achieves better final accuracy with proper tuning
  3. Learning rate schedules improve all optimizers
  4. AdamW essential for training transformers
  5. Monitor convergence: Loss curves reveal optimizer issues

Modern optimizers dramatically accelerate machine learning by adapting learning rates automatically. Understanding their trade-offs enables choosing the right tool for each problem.