Gradient Descent Theory

Gradient descent is the fundamental optimization algorithm used to train machine learning models. It iteratively adjusts model parameters to minimize a loss function by following the direction of steepest descent.

Mathematical Foundation

The Core Idea

Given a differentiable loss function L(θ) where θ represents model parameters, gradient descent finds parameters that minimize the loss:

θ* = argmin L(θ)
       θ

The algorithm works by repeatedly taking steps proportional to the negative gradient of the loss function:

θ(t+1) = θ(t) - η ∇L(θ(t))

Where:

θ(t): Parameters at iteration t
η: Learning rate (step size)
∇L(θ(t)): Gradient of loss with respect to parameters

Why the Negative Gradient?

The gradient ∇L(θ) points in the direction of steepest ascent (maximum increase in loss). By moving in the negative gradient direction, we move toward the steepest descent (minimum loss).

Intuition: Imagine standing on a mountain in thick fog. You can feel the slope beneath your feet but can't see the valley. Gradient descent is like repeatedly taking a step in the direction that slopes most steeply downward.

Variants of Gradient Descent

1. Batch Gradient Descent (BGD)

Computes the gradient using the entire training dataset:

∇L(θ) = (1/N) Σ ∇L_i(θ)
              i=1..N

Advantages:

Stable convergence (smooth trajectory)
Guaranteed to converge to global minimum (convex functions)
Theoretical guarantees

Disadvantages:

Slow for large datasets (must process all samples)
Memory intensive
May converge to poor local minima (non-convex functions)

When to use: Small datasets (N < 10,000), convex optimization problems

2. Stochastic Gradient Descent (SGD)

Computes gradient using a single random sample at each iteration:

∇L(θ) ≈ ∇L_i(θ)    where i ~ Uniform(1..N)

Advantages:

Fast updates (one sample per iteration)
Can escape shallow local minima (noise helps exploration)
Memory efficient
Online learning capable

Disadvantages:

Noisy convergence (zig-zagging trajectory)
May not converge exactly to minimum
Requires learning rate decay

When to use: Large datasets, online learning, non-convex optimization

aprender implementation:

use aprender::optim::SGD;

let mut optimizer = SGD::new(0.01) // learning rate = 0.01
    .with_momentum(0.9);           // momentum coefficient

// In training loop:
let gradients = compute_gradients(&params, &data);
optimizer.step(&mut params, &gradients);

3. Mini-Batch Gradient Descent

Computes gradient using a small batch of samples (typically 32-256):

∇L(θ) ≈ (1/B) Σ ∇L_i(θ)    where B << N
             i∈batch

Advantages:

Balance between BGD stability and SGD speed
Vectorized operations (GPU/SIMD acceleration)
Reduced variance compared to SGD
Memory efficient

Disadvantages:

Batch size is a hyperparameter to tune
Still has some noise

When to use: Default choice for most ML problems, deep learning

Batch size guidelines:

Small batches (32-64): Better generalization, more noise
Large batches (128-512): Faster convergence, more stable
Powers of 2: Better hardware utilization

The Learning Rate

The learning rate η is the most critical hyperparameter in gradient descent.

Too Small Learning Rate

η = 0.001 (very small)

Loss over time:
1000 ┤
 900 ┤
 800 ┤
 700 ┤●
 600 ┤ ●
 500 ┤  ●
 400 ┤   ●●●●●●●●●●●●●●●  ← Slow convergence
     └─────────────────────→
        Iterations (10,000)

Problem: Training is very slow, may not converge within time budget.

Too Large Learning Rate

η = 1.0 (very large)

Loss over time:
1000 ┤    ●
 900 ┤   ● ●
 800 ┤  ●   ●
 700 ┤ ●     ●  ← Oscillation
 600 ┤●       ●
     └──────────→
      Iterations

Problem: Loss oscillates or diverges, never converges to minimum.

Optimal Learning Rate

η = 0.1 (just right)

Loss over time:
1000 ┤●
 800 ┤ ●●
 600 ┤    ●●●
 400 ┤       ●●●●  ← Smooth, fast convergence
 200 ┤           ●●●●
     └───────────────→
         Iterations

Guidelines for choosing η:

Start with η = 0.1 and adjust by factors of 10
Use learning rate schedules (decay over time)
Monitor loss: if exploding → reduce η; if stagnating → increase η
Try adaptive methods (Adam, RMSprop) that auto-tune η

Convergence Analysis

Convex Functions

For convex loss functions (e.g., linear regression with MSE), gradient descent with fixed learning rate converges to the global minimum:

L(θ(t)) - L(θ*) ≤ C / t

Where C is a constant. The gap to the optimal loss decreases as 1/t.

Convergence rate: O(1/t) for fixed learning rate

Non-Convex Functions

For non-convex functions (e.g., neural networks), gradient descent may converge to:

Local minimum
Saddle point
Plateau region

No guarantees of finding the global minimum, but SGD's noise helps escape poor local minima.

Stopping Criteria

When to stop iterating?

Gradient magnitude: Stop when ||∇L(θ)|| < ε
- ε = 1e-4 typical threshold
Loss change: Stop when |L(t) - L(t-1)| < ε
- Measures improvement per iteration
Maximum iterations: Stop after T iterations
- Prevents infinite loops
Validation loss: Stop when validation loss stops improving
- Prevents overfitting

aprender example:

// SGD with convergence monitoring
let mut optimizer = SGD::new(0.01);
let mut prev_loss = f32::INFINITY;
let tolerance = 1e-4;

for epoch in 0..max_epochs {
    let loss = compute_loss(&model, &data);

    // Check convergence
    if (prev_loss - loss).abs() < tolerance {
        println!("Converged at epoch {}", epoch);
        break;
    }

    let gradients = compute_gradients(&model, &data);
    optimizer.step(&mut model.params, &gradients);
    prev_loss = loss;
}

Common Pitfalls and Solutions

1. Exploding Gradients

Problem: Gradients become very large, causing parameters to explode.

Symptoms:

Loss becomes NaN or infinity
Parameters grow to extreme values
Occurs in deep networks or RNNs

Solutions:

Reduce learning rate
Gradient clipping: g = min(g, threshold)
Use batch normalization
Better initialization (Xavier, He)

2. Vanishing Gradients

Problem: Gradients become very small, preventing parameter updates.

Symptoms:

Loss stops decreasing but hasn't converged
Parameters barely change
Occurs in very deep networks

Solutions:

Use ReLU activation (instead of sigmoid/tanh)
Skip connections (ResNet architecture)
Batch normalization
Better initialization

3. Learning Rate Decay

Strategy: Start with large learning rate, gradually decrease it.

Common schedules:

// 1. Step decay: Reduce by factor every K epochs
η(t) = η₀ × 0.1^(floor(t / K))

// 2. Exponential decay: Smooth reduction
η(t) = η₀ × e^(-λt)

// 3. 1/t decay: Theoretical convergence guarantee
η(t) = η₀ / (1 + λt)

// 4. Cosine annealing: Cyclical with restarts
η(t) = η_min + 0.5(η_max - η_min)(1 + cos(πt/T))

aprender pattern (manual implementation):

let initial_lr = 0.1;
let decay_rate = 0.95;

for epoch in 0..num_epochs {
    let lr = initial_lr * decay_rate.powi(epoch as i32);
    let mut optimizer = SGD::new(lr);

    // Training step
    optimizer.step(&mut params, &gradients);
}

4. Saddle Points

Problem: Gradient is zero but point is not a minimum.

Surface shape at saddle point:
    ╱╲    (upward in one direction)
   ╱  ╲
  ╱    ╲
 ╱______╲  (downward in another)

Solutions:

Add momentum (helps escape saddle points)
Use SGD noise to explore
Second-order methods (Newton, L-BFGS)

Momentum Enhancement

Standard SGD can be slow in regions with:

High curvature (steep in some directions, flat in others)
Noisy gradients

Momentum accelerates convergence by accumulating past gradients:

v(t) = βv(t-1) + ∇L(θ(t))      // Velocity accumulation
θ(t+1) = θ(t) - η v(t)          // Update with velocity

Where β ∈ [0, 1] is the momentum coefficient (typically 0.9).

Effect:

Smooths out noisy gradients
Accelerates in consistent directions
Dampens oscillations

Analogy: A ball rolling down a hill builds momentum, doesn't stop at small bumps.

aprender implementation:

let mut optimizer = SGD::new(0.01)
    .with_momentum(0.9);  // β = 0.9

// Momentum is applied automatically in step()
optimizer.step(&mut params, &gradients);

Practical Guidelines

Choosing Gradient Descent Variant

Dataset Size	Recommendation	Reason
N < 1,000	Batch GD	Fast enough, stable convergence
N = 1K-100K	Mini-batch GD (32-128)	Good balance
N > 100K	Mini-batch GD (128-512)	Leverage vectorization
Streaming data	SGD	Online learning required

Hyperparameter Tuning Checklist

Learning rate η:
- Start: 0.1
- Grid search: [0.001, 0.01, 0.1, 1.0]
- Use learning rate finder
Momentum β:
- Default: 0.9
- Range: [0.5, 0.9, 0.99]
Batch size B:
- Default: 32 or 64
- Range: [16, 32, 64, 128, 256]
- Powers of 2 for hardware efficiency
Learning rate schedule:
- Option 1: Fixed (simple baseline)
- Option 2: Step decay every 10-30 epochs
- Option 3: Cosine annealing (state-of-the-art)

Debugging Convergence Issues

Loss increasing: Learning rate too large → Reduce η by 10x

Loss stagnating: Learning rate too small or stuck in local minimum → Increase η by 2x or add momentum

Loss NaN: Exploding gradients → Reduce η, clip gradients, check data preprocessing

Slow convergence: Poor learning rate or no momentum → Use adaptive optimizer (Adam), add momentum

Connection to aprender

The aprender::optim::SGD implementation provides:

use aprender::optim::{SGD, Optimizer};

// Create SGD optimizer
let mut sgd = SGD::new(learning_rate)
    .with_momentum(momentum_coef);

// In training loop:
for epoch in 0..num_epochs {
    for batch in data_loader {
        // 1. Forward pass
        let predictions = model.predict(&batch.x);

        // 2. Compute loss and gradients
        let loss = loss_fn(predictions, batch.y);
        let grads = compute_gradients(&model, &batch);

        // 3. Update parameters using gradient descent
        sgd.step(&mut model.params, &grads);
    }
}

Key methods:

SGD::new(η): Create optimizer with learning rate
with_momentum(β): Add momentum coefficient
step(&mut params, &grads): Perform one gradient descent step
reset(): Reset momentum buffers

Summary

Concept	Key Takeaway
Core algorithm	θ(t+1) = θ(t) - η ∇L(θ(t))
Learning rate	Most critical hyperparameter; start with 0.1
Variants	Batch (stable), SGD (fast), Mini-batch (best of both)
Momentum	Accelerates convergence, smooths gradients
Convergence	Guaranteed for convex functions with proper η
Debugging	Loss ↑ → reduce η; Loss flat → increase η or add momentum

Gradient descent is the workhorse of machine learning optimization. Understanding its variants, hyperparameters, and convergence properties is essential for training effective models.

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning

Gradient Descent Theory

Mathematical Foundation

The Core Idea

Why the Negative Gradient?

Variants of Gradient Descent

1. Batch Gradient Descent (BGD)

2. Stochastic Gradient Descent (SGD)

3. Mini-Batch Gradient Descent

The Learning Rate

Too Small Learning Rate

Too Large Learning Rate

Optimal Learning Rate

Convergence Analysis

Convex Functions

Non-Convex Functions

Stopping Criteria

Common Pitfalls and Solutions

1. Exploding Gradients

2. Vanishing Gradients

3. Learning Rate Decay

4. Saddle Points

Momentum Enhancement

Practical Guidelines

Choosing Gradient Descent Variant

Hyperparameter Tuning Checklist

Debugging Convergence Issues

Connection to aprender

Further Reading

Summary