Case Study: Batch Optimization

This example demonstrates batch optimization algorithms for minimizing smooth, differentiable objective functions using gradient and Hessian information.

Overview

Batch optimization algorithms process the entire dataset at once (as opposed to stochastic/mini-batch methods). This example covers three powerful second-order methods:

L-BFGS: Limited-memory BFGS (quasi-Newton method)
Conjugate Gradient: CG with multiple β formulas
Damped Newton: Newton's method with finite differences

Test Functions

The examples use classic optimization test functions:

Rosenbrock Function

f(x,y) = (1-x)² + 100(y-x²)²

Global minimum at (1, 1). Features a narrow, curved valley making it challenging for optimizers.

Sphere Function

f(x) = Σ x_i²

Convex quadratic with global minimum at origin. Easy test case - all optimizers should converge quickly.

Booth Function

f(x,y) = (x + 2y - 7)² + (2x + y - 5)²

Global minimum at (1, 3) with f(1, 3) = 0.

Examples Covered

1. Rosenbrock Function with Different Optimizers

Compares L-BFGS, Conjugate Gradient (Polak-Ribière and Fletcher-Reeves), and Damped Newton on the challenging Rosenbrock function.

2. Sphere Function (5D)

Tests all optimizers on a simple convex quadratic to verify correct implementation and fast convergence.

3. Booth Function

Demonstrates convergence on a moderately difficult quadratic problem.

4. Convergence Comparison

Runs optimizers from different initial points to analyze convergence behavior and robustness.

5. Optimizer Configuration

Shows how to configure:

L-BFGS history size (m)
CG periodic restart
Damped Newton finite difference epsilon

Key Insights

L-BFGS

Memory: Stores m recent gradients (typically m=10)
Convergence: Superlinear for smooth convex functions
Use case: General-purpose, large-scale optimization
Cost: O(mn) per iteration

Conjugate Gradient

Formulas: Polak-Ribière, Fletcher-Reeves, Hestenes-Stiefel
Memory: O(n) only (no history storage)
Convergence: Linear for quadratics, can stall on non-quadratics
Use case: When memory is limited, or Hessian is expensive
Tip: Periodic restart (every n iterations) helps non-quadratic problems

Damped Newton

Hessian: Approximated via finite differences
Convergence: Quadratic near minimum (fastest locally)
Use case: High-accuracy solutions, few variables
Cost: O(n²) Hessian approximation per iteration

Convergence Comparison

Method	Rosenbrock Iters	Sphere Iters	Memory
L-BFGS	~40-60	~10-15	O(mn)
CG-PR	~80-120	~5-10	O(n)
CG-FR	~100-150	~8-12	O(n)
Damped Newton	~20-30	~3-5	O(n²)

Running the Example

cargo run --example batch_optimization

The example runs all test functions with all optimizers, displaying:

Convergence status
Iteration count
Final solution
Objective value
Gradient norm
Elapsed time

Optimization Tips

L-BFGS is the default choice for most smooth optimization problems
Use CG when memory is constrained (large n)
Use Damped Newton for high accuracy on smaller problems
Always try multiple starting points to avoid local minima
Monitor gradient norm - should decrease to near-zero at optimum

Code Location

See examples/batch_optimization.rs for full implementation.

Aprender — Pure Rust ML Framework