Gradient Descent Theory

Gradient descent is the fundamental optimization algorithm used to train machine learning models. It iteratively adjusts model parameters to minimize a loss function by following the direction of steepest descent.

Mathematical Foundation

The Core Idea

Given a differentiable loss function L(θ) where θ represents model parameters, gradient descent finds parameters that minimize the loss:

θ* = argmin L(θ)
       θ

The algorithm works by repeatedly taking steps proportional to the negative gradient of the loss function:

θ(t+1) = θ(t) - η ∇L(θ(t))

Where:

  • θ(t): Parameters at iteration t
  • η: Learning rate (step size)
  • ∇L(θ(t)): Gradient of loss with respect to parameters

Why the Negative Gradient?

The gradient ∇L(θ) points in the direction of steepest ascent (maximum increase in loss). By moving in the negative gradient direction, we move toward the steepest descent (minimum loss).

Intuition: Imagine standing on a mountain in thick fog. You can feel the slope beneath your feet but can't see the valley. Gradient descent is like repeatedly taking a step in the direction that slopes most steeply downward.

Variants of Gradient Descent

1. Batch Gradient Descent (BGD)

Computes the gradient using the entire training dataset:

∇L(θ) = (1/N) Σ ∇L_i(θ)
              i=1..N

Advantages:

  • Stable convergence (smooth trajectory)
  • Guaranteed to converge to global minimum (convex functions)
  • Theoretical guarantees

Disadvantages:

  • Slow for large datasets (must process all samples)
  • Memory intensive
  • May converge to poor local minima (non-convex functions)

When to use: Small datasets (N < 10,000), convex optimization problems

2. Stochastic Gradient Descent (SGD)

Computes gradient using a single random sample at each iteration:

∇L(θ) ≈ ∇L_i(θ)    where i ~ Uniform(1..N)

Advantages:

  • Fast updates (one sample per iteration)
  • Can escape shallow local minima (noise helps exploration)
  • Memory efficient
  • Online learning capable

Disadvantages:

  • Noisy convergence (zig-zagging trajectory)
  • May not converge exactly to minimum
  • Requires learning rate decay

When to use: Large datasets, online learning, non-convex optimization

aprender implementation:

use aprender::optim::SGD;

let mut optimizer = SGD::new(0.01) // learning rate = 0.01
    .with_momentum(0.9);           // momentum coefficient

// In training loop:
let gradients = compute_gradients(&params, &data);
optimizer.step(&mut params, &gradients);

3. Mini-Batch Gradient Descent

Computes gradient using a small batch of samples (typically 32-256):

∇L(θ) ≈ (1/B) Σ ∇L_i(θ)    where B << N
             i∈batch

Advantages:

  • Balance between BGD stability and SGD speed
  • Vectorized operations (GPU/SIMD acceleration)
  • Reduced variance compared to SGD
  • Memory efficient

Disadvantages:

  • Batch size is a hyperparameter to tune
  • Still has some noise

When to use: Default choice for most ML problems, deep learning

Batch size guidelines:

  • Small batches (32-64): Better generalization, more noise
  • Large batches (128-512): Faster convergence, more stable
  • Powers of 2: Better hardware utilization

The Learning Rate

The learning rate η is the most critical hyperparameter in gradient descent.

Too Small Learning Rate

η = 0.001 (very small)

Loss over time:
1000 ┤
 900 ┤
 800 ┤
 700 ┤●
 600 ┤ ●
 500 ┤  ●
 400 ┤   ●●●●●●●●●●●●●●●  ← Slow convergence
     └─────────────────────→
        Iterations (10,000)

Problem: Training is very slow, may not converge within time budget.

Too Large Learning Rate

η = 1.0 (very large)

Loss over time:
1000 ┤    ●
 900 ┤   ● ●
 800 ┤  ●   ●
 700 ┤ ●     ●  ← Oscillation
 600 ┤●       ●
     └──────────→
      Iterations

Problem: Loss oscillates or diverges, never converges to minimum.

Optimal Learning Rate

η = 0.1 (just right)

Loss over time:
1000 ┤●
 800 ┤ ●●
 600 ┤    ●●●
 400 ┤       ●●●●  ← Smooth, fast convergence
 200 ┤           ●●●●
     └───────────────→
         Iterations

Guidelines for choosing η:

  1. Start with η = 0.1 and adjust by factors of 10
  2. Use learning rate schedules (decay over time)
  3. Monitor loss: if exploding → reduce η; if stagnating → increase η
  4. Try adaptive methods (Adam, RMSprop) that auto-tune η

Convergence Analysis

Convex Functions

For convex loss functions (e.g., linear regression with MSE), gradient descent with fixed learning rate converges to the global minimum:

L(θ(t)) - L(θ*) ≤ C / t

Where C is a constant. The gap to the optimal loss decreases as 1/t.

Convergence rate: O(1/t) for fixed learning rate

Non-Convex Functions

For non-convex functions (e.g., neural networks), gradient descent may converge to:

  • Local minimum
  • Saddle point
  • Plateau region

No guarantees of finding the global minimum, but SGD's noise helps escape poor local minima.

Stopping Criteria

When to stop iterating?

  1. Gradient magnitude: Stop when ||∇L(θ)|| < ε

    • ε = 1e-4 typical threshold
  2. Loss change: Stop when |L(t) - L(t-1)| < ε

    • Measures improvement per iteration
  3. Maximum iterations: Stop after T iterations

    • Prevents infinite loops
  4. Validation loss: Stop when validation loss stops improving

    • Prevents overfitting

aprender example:

// SGD with convergence monitoring
let mut optimizer = SGD::new(0.01);
let mut prev_loss = f32::INFINITY;
let tolerance = 1e-4;

for epoch in 0..max_epochs {
    let loss = compute_loss(&model, &data);

    // Check convergence
    if (prev_loss - loss).abs() < tolerance {
        println!("Converged at epoch {}", epoch);
        break;
    }

    let gradients = compute_gradients(&model, &data);
    optimizer.step(&mut model.params, &gradients);
    prev_loss = loss;
}

Common Pitfalls and Solutions

1. Exploding Gradients

Problem: Gradients become very large, causing parameters to explode.

Symptoms:

  • Loss becomes NaN or infinity
  • Parameters grow to extreme values
  • Occurs in deep networks or RNNs

Solutions:

  • Reduce learning rate
  • Gradient clipping: g = min(g, threshold)
  • Use batch normalization
  • Better initialization (Xavier, He)

2. Vanishing Gradients

Problem: Gradients become very small, preventing parameter updates.

Symptoms:

  • Loss stops decreasing but hasn't converged
  • Parameters barely change
  • Occurs in very deep networks

Solutions:

  • Use ReLU activation (instead of sigmoid/tanh)
  • Skip connections (ResNet architecture)
  • Batch normalization
  • Better initialization

3. Learning Rate Decay

Strategy: Start with large learning rate, gradually decrease it.

Common schedules:

// 1. Step decay: Reduce by factor every K epochs
η(t) = η₀ × 0.1^(floor(t / K))

// 2. Exponential decay: Smooth reduction
η(t) = η₀ × e^(-λt)

// 3. 1/t decay: Theoretical convergence guarantee
η(t) = η₀ / (1 + λt)

// 4. Cosine annealing: Cyclical with restarts
η(t) = η_min + 0.5(η_max - η_min)(1 + cos(πt/T))

aprender pattern (manual implementation):

let initial_lr = 0.1;
let decay_rate = 0.95;

for epoch in 0..num_epochs {
    let lr = initial_lr * decay_rate.powi(epoch as i32);
    let mut optimizer = SGD::new(lr);

    // Training step
    optimizer.step(&mut params, &gradients);
}

4. Saddle Points

Problem: Gradient is zero but point is not a minimum.

Surface shape at saddle point:
    ╱╲    (upward in one direction)
   ╱  ╲
  ╱    ╲
 ╱______╲  (downward in another)

Solutions:

  • Add momentum (helps escape saddle points)
  • Use SGD noise to explore
  • Second-order methods (Newton, L-BFGS)

Momentum Enhancement

Standard SGD can be slow in regions with:

  • High curvature (steep in some directions, flat in others)
  • Noisy gradients

Momentum accelerates convergence by accumulating past gradients:

v(t) = βv(t-1) + ∇L(θ(t))      // Velocity accumulation
θ(t+1) = θ(t) - η v(t)          // Update with velocity

Where β ∈ [0, 1] is the momentum coefficient (typically 0.9).

Effect:

  • Smooths out noisy gradients
  • Accelerates in consistent directions
  • Dampens oscillations

Analogy: A ball rolling down a hill builds momentum, doesn't stop at small bumps.

aprender implementation:

let mut optimizer = SGD::new(0.01)
    .with_momentum(0.9);  // β = 0.9

// Momentum is applied automatically in step()
optimizer.step(&mut params, &gradients);

Practical Guidelines

Choosing Gradient Descent Variant

Dataset SizeRecommendationReason
N < 1,000Batch GDFast enough, stable convergence
N = 1K-100KMini-batch GD (32-128)Good balance
N > 100KMini-batch GD (128-512)Leverage vectorization
Streaming dataSGDOnline learning required

Hyperparameter Tuning Checklist

  1. Learning rate η:

    • Start: 0.1
    • Grid search: [0.001, 0.01, 0.1, 1.0]
    • Use learning rate finder
  2. Momentum β:

    • Default: 0.9
    • Range: [0.5, 0.9, 0.99]
  3. Batch size B:

    • Default: 32 or 64
    • Range: [16, 32, 64, 128, 256]
    • Powers of 2 for hardware efficiency
  4. Learning rate schedule:

    • Option 1: Fixed (simple baseline)
    • Option 2: Step decay every 10-30 epochs
    • Option 3: Cosine annealing (state-of-the-art)

Debugging Convergence Issues

Loss increasing: Learning rate too large → Reduce η by 10x

Loss stagnating: Learning rate too small or stuck in local minimum → Increase η by 2x or add momentum

Loss NaN: Exploding gradients → Reduce η, clip gradients, check data preprocessing

Slow convergence: Poor learning rate or no momentum → Use adaptive optimizer (Adam), add momentum

Connection to aprender

The aprender::optim::SGD implementation provides:

use aprender::optim::{SGD, Optimizer};

// Create SGD optimizer
let mut sgd = SGD::new(learning_rate)
    .with_momentum(momentum_coef);

// In training loop:
for epoch in 0..num_epochs {
    for batch in data_loader {
        // 1. Forward pass
        let predictions = model.predict(&batch.x);

        // 2. Compute loss and gradients
        let loss = loss_fn(predictions, batch.y);
        let grads = compute_gradients(&model, &batch);

        // 3. Update parameters using gradient descent
        sgd.step(&mut model.params, &grads);
    }
}

Key methods:

  • SGD::new(η): Create optimizer with learning rate
  • with_momentum(β): Add momentum coefficient
  • step(&mut params, &grads): Perform one gradient descent step
  • reset(): Reset momentum buffers

Further Reading

  • Theory: Bottou, L. (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent"
  • Momentum: Polyak, B. T. (1964). "Some methods of speeding up the convergence of iteration methods"
  • Adaptive methods: See Advanced Optimizers Theory

Summary

ConceptKey Takeaway
Core algorithmθ(t+1) = θ(t) - η ∇L(θ(t))
Learning rateMost critical hyperparameter; start with 0.1
VariantsBatch (stable), SGD (fast), Mini-batch (best of both)
MomentumAccelerates convergence, smooths gradients
ConvergenceGuaranteed for convex functions with proper η
DebuggingLoss ↑ → reduce η; Loss flat → increase η or add momentum

Gradient descent is the workhorse of machine learning optimization. Understanding its variants, hyperparameters, and convergence properties is essential for training effective models.