Case Study: Convex Optimization

This example demonstrates Phase 2 convex optimization methods designed for composite problems with non-smooth regularization.

Overview

Two specialized algorithms are covered:

FISTA (Fast Iterative Shrinkage-Thresholding Algorithm)
Coordinate Descent

Both methods excel at solving composite optimization:

minimize f(x) + g(x)

where f is smooth (differentiable) and g is "simple" (has easy proximal operator).

Mathematical Background

FISTA

Problem: minimize f(x) + g(x)

Key idea: Proximal gradient method with Nesterov acceleration

Achieves: O(1/k²) convergence (faster than standard gradient descent's O(1/k))

Proximal operator: prox_g(v, α) = argmin_x {½‖x - v‖² + α·g(x)}

Coordinate Descent

Problem: minimize f(x)

Key idea: Update one coordinate at a time

Algorithm: x^(k+1)i = argmin_z f(x^(k)1, ..., x^(k){i-1}, z, x^(k){i+1}, ..., x^(k)_n)

Particularly effective when:

Coordinate updates have closed-form solutions
Problem dimension is very high (n >> m)
Hessian is expensive to compute

Examples Covered

1. Lasso Regression with FISTA

Problem: minimize ½‖Ax - b‖² + λ‖x‖₁

The classic Lasso problem:

Smooth part: f(x) = ½‖Ax - b‖² (least squares)
Non-smooth part: g(x) = λ‖x‖₁ (L1 regularization for sparsity)
Proximal operator: Soft-thresholding

Demonstrates sparse recovery with only 3 non-zero coefficients out of 20 features.

2. Non-Negative Least Squares with FISTA

Problem: minimize ½‖Ax - b‖² subject to x ≥ 0

Applications:

Spectral unmixing
Image processing
Chemometrics

Uses projection onto non-negative orthant as proximal operator.

3. High-Dimensional Lasso with Coordinate Descent

Problem: minimize ½‖Ax - b‖² + λ‖x‖₁ (n >> m)

With 100 features and only 30 samples (n >> m), demonstrates:

Coordinate Descent efficiency in high dimensions
Closed-form soft-thresholding updates
Sparse recovery (5 non-zero out of 100)

4. Box-Constrained Quadratic Programming

Problem: minimize ½xᵀQx - cᵀx subject to l ≤ x ≤ u

Coordinate Descent with projection:

Each coordinate update is a simple 1D optimization
Project onto box constraints [l, u]
Track active constraints (variables at bounds)

5. FISTA vs Coordinate Descent Comparison

Side-by-side comparison on the same Lasso problem:

Convergence behavior
Computational cost
Solution quality

Proximal Operators

Key proximal operators used in examples:

Soft-Thresholding (L1 norm)

prox::soft_threshold(v, λ) = {
    v_i - λ  if v_i > λ
    0        if |v_i| ≤ λ
    v_i + λ  if v_i < -λ
}

Non-negative Projection

prox::nonnegative(v) = max(v, 0)

Box Projection

prox::box(v, l, u) = clamp(v, l, u)

Performance Comparison

Method	Problem Type	Iterations	Memory	Best For
FISTA	Composite f+g	Low (~50-200)	O(n)	General composite problems
Coordinate Descent	Separable updates	Medium (~100-500)	O(n)	High-dimensional (n >> m)

Key Insights

When to Use FISTA

✅ General composite optimization (smooth + non-smooth)
✅ Fast O(1/k²) convergence with Nesterov acceleration
✅ Works well for medium-scale problems
✅ Proximal operator available in closed form
❌ Requires Lipschitz constant estimation (step size tuning)

When to Use Coordinate Descent

✅ High-dimensional problems (n >> m)
✅ Coordinate updates have closed-form solutions
✅ Very simple implementation
✅ No global gradients needed
❌ Slower convergence rate than FISTA
❌ Performance depends on coordinate ordering

Convergence Analysis

Both methods track:

Iterations: Number of outer iterations
Objective value: Final f(x) + g(x)
Sparsity: Number of non-zero coefficients (for Lasso)
Constraint violation: ‖max(0, -x)‖ for non-negativity
Elapsed time: Total optimization time

Running the Examples

cargo run --example convex_optimization

The examples demonstrate:

Lasso with FISTA (20 features, 50 samples)
Non-negative LS with FISTA (10 features, 30 samples)
High-dimensional Lasso with CD (100 features, 30 samples)
Box-constrained QP with CD (15 variables)
FISTA vs CD comparison (30 features, 50 samples)

Practical Tips

For FISTA

Step size: Start with α = 0.01, use line search or backtracking
Tolerance: Set to 1e-4 to 1e-6 depending on accuracy needs
Restart: Implement adaptive restart for non-strongly convex problems
Acceleration: Always use Nesterov momentum for faster convergence

For Coordinate Descent

Ordering: Cyclic (1,2,...,n) is simplest, random can help
Convergence: Check ‖x^k - x^{k-1}‖ < tol for stopping
Updates: Precompute any expensive quantities (e.g., column norms)
Warm starts: Initialize with previous solution when solving sequence of problems

Comparison Summary

Solution Quality: Both methods find nearly identical solutions (‖x_FISTA - x_CD‖ < 1e-5)

Speed:

FISTA: Faster for moderate n (~30-100)
Coordinate Descent: Faster for large n (>100)

Memory:

FISTA: O(n) gradient storage
Coordinate Descent: O(n) solution only

Ease of Use:

FISTA: Requires step size tuning
Coordinate Descent: Requires coordinate update implementation

Code Location

See examples/convex_optimization.rs for full implementation.

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning