Gradient Descent Theory
Gradient descent is the fundamental optimization algorithm used to train machine learning models. It iteratively adjusts model parameters to minimize a loss function by following the direction of steepest descent.
Mathematical Foundation
The Core Idea
Given a differentiable loss function L(θ) where θ represents model parameters, gradient descent finds parameters that minimize the loss:
θ* = argmin L(θ)
θ
The algorithm works by repeatedly taking steps proportional to the negative gradient of the loss function:
θ(t+1) = θ(t) - η ∇L(θ(t))
Where:
- θ(t): Parameters at iteration t
- η: Learning rate (step size)
- ∇L(θ(t)): Gradient of loss with respect to parameters
Why the Negative Gradient?
The gradient ∇L(θ) points in the direction of steepest ascent (maximum increase in loss). By moving in the negative gradient direction, we move toward the steepest descent (minimum loss).
Intuition: Imagine standing on a mountain in thick fog. You can feel the slope beneath your feet but can't see the valley. Gradient descent is like repeatedly taking a step in the direction that slopes most steeply downward.
Variants of Gradient Descent
1. Batch Gradient Descent (BGD)
Computes the gradient using the entire training dataset:
∇L(θ) = (1/N) Σ ∇L_i(θ)
i=1..N
Advantages:
- Stable convergence (smooth trajectory)
- Guaranteed to converge to global minimum (convex functions)
- Theoretical guarantees
Disadvantages:
- Slow for large datasets (must process all samples)
- Memory intensive
- May converge to poor local minima (non-convex functions)
When to use: Small datasets (N < 10,000), convex optimization problems
2. Stochastic Gradient Descent (SGD)
Computes gradient using a single random sample at each iteration:
∇L(θ) ≈ ∇L_i(θ) where i ~ Uniform(1..N)
Advantages:
- Fast updates (one sample per iteration)
- Can escape shallow local minima (noise helps exploration)
- Memory efficient
- Online learning capable
Disadvantages:
- Noisy convergence (zig-zagging trajectory)
- May not converge exactly to minimum
- Requires learning rate decay
When to use: Large datasets, online learning, non-convex optimization
aprender implementation:
use aprender::optim::SGD;
let mut optimizer = SGD::new(0.01) // learning rate = 0.01
.with_momentum(0.9); // momentum coefficient
// In training loop:
let gradients = compute_gradients(¶ms, &data);
optimizer.step(&mut params, &gradients);
3. Mini-Batch Gradient Descent
Computes gradient using a small batch of samples (typically 32-256):
∇L(θ) ≈ (1/B) Σ ∇L_i(θ) where B << N
i∈batch
Advantages:
- Balance between BGD stability and SGD speed
- Vectorized operations (GPU/SIMD acceleration)
- Reduced variance compared to SGD
- Memory efficient
Disadvantages:
- Batch size is a hyperparameter to tune
- Still has some noise
When to use: Default choice for most ML problems, deep learning
Batch size guidelines:
- Small batches (32-64): Better generalization, more noise
- Large batches (128-512): Faster convergence, more stable
- Powers of 2: Better hardware utilization
The Learning Rate
The learning rate η is the most critical hyperparameter in gradient descent.
Too Small Learning Rate
η = 0.001 (very small)
Loss over time:
1000 ┤
900 ┤
800 ┤
700 ┤●
600 ┤ ●
500 ┤ ●
400 ┤ ●●●●●●●●●●●●●●● ← Slow convergence
└─────────────────────→
Iterations (10,000)
Problem: Training is very slow, may not converge within time budget.
Too Large Learning Rate
η = 1.0 (very large)
Loss over time:
1000 ┤ ●
900 ┤ ● ●
800 ┤ ● ●
700 ┤ ● ● ← Oscillation
600 ┤● ●
└──────────→
Iterations
Problem: Loss oscillates or diverges, never converges to minimum.
Optimal Learning Rate
η = 0.1 (just right)
Loss over time:
1000 ┤●
800 ┤ ●●
600 ┤ ●●●
400 ┤ ●●●● ← Smooth, fast convergence
200 ┤ ●●●●
└───────────────→
Iterations
Guidelines for choosing η:
- Start with η = 0.1 and adjust by factors of 10
- Use learning rate schedules (decay over time)
- Monitor loss: if exploding → reduce η; if stagnating → increase η
- Try adaptive methods (Adam, RMSprop) that auto-tune η
Convergence Analysis
Convex Functions
For convex loss functions (e.g., linear regression with MSE), gradient descent with fixed learning rate converges to the global minimum:
L(θ(t)) - L(θ*) ≤ C / t
Where C is a constant. The gap to the optimal loss decreases as 1/t.
Convergence rate: O(1/t) for fixed learning rate
Non-Convex Functions
For non-convex functions (e.g., neural networks), gradient descent may converge to:
- Local minimum
- Saddle point
- Plateau region
No guarantees of finding the global minimum, but SGD's noise helps escape poor local minima.
Stopping Criteria
When to stop iterating?
-
Gradient magnitude: Stop when ||∇L(θ)|| < ε
- ε = 1e-4 typical threshold
-
Loss change: Stop when |L(t) - L(t-1)| < ε
- Measures improvement per iteration
-
Maximum iterations: Stop after T iterations
- Prevents infinite loops
-
Validation loss: Stop when validation loss stops improving
- Prevents overfitting
aprender example:
// SGD with convergence monitoring
let mut optimizer = SGD::new(0.01);
let mut prev_loss = f32::INFINITY;
let tolerance = 1e-4;
for epoch in 0..max_epochs {
let loss = compute_loss(&model, &data);
// Check convergence
if (prev_loss - loss).abs() < tolerance {
println!("Converged at epoch {}", epoch);
break;
}
let gradients = compute_gradients(&model, &data);
optimizer.step(&mut model.params, &gradients);
prev_loss = loss;
}
Common Pitfalls and Solutions
1. Exploding Gradients
Problem: Gradients become very large, causing parameters to explode.
Symptoms:
- Loss becomes NaN or infinity
- Parameters grow to extreme values
- Occurs in deep networks or RNNs
Solutions:
- Reduce learning rate
- Gradient clipping:
g = min(g, threshold) - Use batch normalization
- Better initialization (Xavier, He)
2. Vanishing Gradients
Problem: Gradients become very small, preventing parameter updates.
Symptoms:
- Loss stops decreasing but hasn't converged
- Parameters barely change
- Occurs in very deep networks
Solutions:
- Use ReLU activation (instead of sigmoid/tanh)
- Skip connections (ResNet architecture)
- Batch normalization
- Better initialization
3. Learning Rate Decay
Strategy: Start with large learning rate, gradually decrease it.
Common schedules:
// 1. Step decay: Reduce by factor every K epochs
η(t) = η₀ × 0.1^(floor(t / K))
// 2. Exponential decay: Smooth reduction
η(t) = η₀ × e^(-λt)
// 3. 1/t decay: Theoretical convergence guarantee
η(t) = η₀ / (1 + λt)
// 4. Cosine annealing: Cyclical with restarts
η(t) = η_min + 0.5(η_max - η_min)(1 + cos(πt/T))
aprender pattern (manual implementation):
let initial_lr = 0.1;
let decay_rate = 0.95;
for epoch in 0..num_epochs {
let lr = initial_lr * decay_rate.powi(epoch as i32);
let mut optimizer = SGD::new(lr);
// Training step
optimizer.step(&mut params, &gradients);
}
4. Saddle Points
Problem: Gradient is zero but point is not a minimum.
Surface shape at saddle point:
╱╲ (upward in one direction)
╱ ╲
╱ ╲
╱______╲ (downward in another)
Solutions:
- Add momentum (helps escape saddle points)
- Use SGD noise to explore
- Second-order methods (Newton, L-BFGS)
Momentum Enhancement
Standard SGD can be slow in regions with:
- High curvature (steep in some directions, flat in others)
- Noisy gradients
Momentum accelerates convergence by accumulating past gradients:
v(t) = βv(t-1) + ∇L(θ(t)) // Velocity accumulation
θ(t+1) = θ(t) - η v(t) // Update with velocity
Where β ∈ [0, 1] is the momentum coefficient (typically 0.9).
Effect:
- Smooths out noisy gradients
- Accelerates in consistent directions
- Dampens oscillations
Analogy: A ball rolling down a hill builds momentum, doesn't stop at small bumps.
aprender implementation:
let mut optimizer = SGD::new(0.01)
.with_momentum(0.9); // β = 0.9
// Momentum is applied automatically in step()
optimizer.step(&mut params, &gradients);
Practical Guidelines
Choosing Gradient Descent Variant
| Dataset Size | Recommendation | Reason |
|---|---|---|
| N < 1,000 | Batch GD | Fast enough, stable convergence |
| N = 1K-100K | Mini-batch GD (32-128) | Good balance |
| N > 100K | Mini-batch GD (128-512) | Leverage vectorization |
| Streaming data | SGD | Online learning required |
Hyperparameter Tuning Checklist
-
Learning rate η:
- Start: 0.1
- Grid search: [0.001, 0.01, 0.1, 1.0]
- Use learning rate finder
-
Momentum β:
- Default: 0.9
- Range: [0.5, 0.9, 0.99]
-
Batch size B:
- Default: 32 or 64
- Range: [16, 32, 64, 128, 256]
- Powers of 2 for hardware efficiency
-
Learning rate schedule:
- Option 1: Fixed (simple baseline)
- Option 2: Step decay every 10-30 epochs
- Option 3: Cosine annealing (state-of-the-art)
Debugging Convergence Issues
Loss increasing: Learning rate too large → Reduce η by 10x
Loss stagnating: Learning rate too small or stuck in local minimum → Increase η by 2x or add momentum
Loss NaN: Exploding gradients → Reduce η, clip gradients, check data preprocessing
Slow convergence: Poor learning rate or no momentum → Use adaptive optimizer (Adam), add momentum
Connection to aprender
The aprender::optim::SGD implementation provides:
use aprender::optim::{SGD, Optimizer};
// Create SGD optimizer
let mut sgd = SGD::new(learning_rate)
.with_momentum(momentum_coef);
// In training loop:
for epoch in 0..num_epochs {
for batch in data_loader {
// 1. Forward pass
let predictions = model.predict(&batch.x);
// 2. Compute loss and gradients
let loss = loss_fn(predictions, batch.y);
let grads = compute_gradients(&model, &batch);
// 3. Update parameters using gradient descent
sgd.step(&mut model.params, &grads);
}
}
Key methods:
SGD::new(η): Create optimizer with learning ratewith_momentum(β): Add momentum coefficientstep(&mut params, &grads): Perform one gradient descent stepreset(): Reset momentum buffers
Further Reading
- Theory: Bottou, L. (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent"
- Momentum: Polyak, B. T. (1964). "Some methods of speeding up the convergence of iteration methods"
- Adaptive methods: See Advanced Optimizers Theory
Related Examples
- Optimizer Demo - Visualizing SGD with momentum
- Logistic Regression - SGD for classification
- Regularized Regression - Coordinate descent vs SGD
Summary
| Concept | Key Takeaway |
|---|---|
| Core algorithm | θ(t+1) = θ(t) - η ∇L(θ(t)) |
| Learning rate | Most critical hyperparameter; start with 0.1 |
| Variants | Batch (stable), SGD (fast), Mini-batch (best of both) |
| Momentum | Accelerates convergence, smooths gradients |
| Convergence | Guaranteed for convex functions with proper η |
| Debugging | Loss ↑ → reduce η; Loss flat → increase η or add momentum |
Gradient descent is the workhorse of machine learning optimization. Understanding its variants, hyperparameters, and convergence properties is essential for training effective models.