Case Study: Convex Optimization
This example demonstrates Phase 2 convex optimization methods designed for composite problems with non-smooth regularization.
Overview
Two specialized algorithms are covered:
- FISTA (Fast Iterative Shrinkage-Thresholding Algorithm)
- Coordinate Descent
Both methods excel at solving composite optimization:
minimize f(x) + g(x)
where f is smooth (differentiable) and g is "simple" (has easy proximal operator).
Mathematical Background
FISTA
Problem: minimize f(x) + g(x)
Key idea: Proximal gradient method with Nesterov acceleration
Achieves: O(1/k²) convergence (faster than standard gradient descent's O(1/k))
Proximal operator: prox_g(v, α) = argmin_x {½‖x - v‖² + α·g(x)}
Coordinate Descent
Problem: minimize f(x)
Key idea: Update one coordinate at a time
Algorithm: x^(k+1)i = argmin_z f(x^(k)1, ..., x^(k){i-1}, z, x^(k){i+1}, ..., x^(k)_n)
Particularly effective when:
- Coordinate updates have closed-form solutions
- Problem dimension is very high (n >> m)
- Hessian is expensive to compute
Examples Covered
1. Lasso Regression with FISTA
Problem: minimize ½‖Ax - b‖² + λ‖x‖₁
The classic Lasso problem:
- Smooth part: f(x) = ½‖Ax - b‖² (least squares)
- Non-smooth part: g(x) = λ‖x‖₁ (L1 regularization for sparsity)
- Proximal operator: Soft-thresholding
Demonstrates sparse recovery with only 3 non-zero coefficients out of 20 features.
2. Non-Negative Least Squares with FISTA
Problem: minimize ½‖Ax - b‖² subject to x ≥ 0
Applications:
- Spectral unmixing
- Image processing
- Chemometrics
Uses projection onto non-negative orthant as proximal operator.
3. High-Dimensional Lasso with Coordinate Descent
Problem: minimize ½‖Ax - b‖² + λ‖x‖₁ (n >> m)
With 100 features and only 30 samples (n >> m), demonstrates:
- Coordinate Descent efficiency in high dimensions
- Closed-form soft-thresholding updates
- Sparse recovery (5 non-zero out of 100)
4. Box-Constrained Quadratic Programming
Problem: minimize ½xᵀQx - cᵀx subject to l ≤ x ≤ u
Coordinate Descent with projection:
- Each coordinate update is a simple 1D optimization
- Project onto box constraints [l, u]
- Track active constraints (variables at bounds)
5. FISTA vs Coordinate Descent Comparison
Side-by-side comparison on the same Lasso problem:
- Convergence behavior
- Computational cost
- Solution quality
Proximal Operators
Key proximal operators used in examples:
Soft-Thresholding (L1 norm)
prox::soft_threshold(v, λ) = {
v_i - λ if v_i > λ
0 if |v_i| ≤ λ
v_i + λ if v_i < -λ
}
Non-negative Projection
prox::nonnegative(v) = max(v, 0)
Box Projection
prox::box(v, l, u) = clamp(v, l, u)
Performance Comparison
| Method | Problem Type | Iterations | Memory | Best For |
|---|---|---|---|---|
| FISTA | Composite f+g | Low (~50-200) | O(n) | General composite problems |
| Coordinate Descent | Separable updates | Medium (~100-500) | O(n) | High-dimensional (n >> m) |
Key Insights
When to Use FISTA
- ✅ General composite optimization (smooth + non-smooth)
- ✅ Fast O(1/k²) convergence with Nesterov acceleration
- ✅ Works well for medium-scale problems
- ✅ Proximal operator available in closed form
- ❌ Requires Lipschitz constant estimation (step size tuning)
When to Use Coordinate Descent
- ✅ High-dimensional problems (n >> m)
- ✅ Coordinate updates have closed-form solutions
- ✅ Very simple implementation
- ✅ No global gradients needed
- ❌ Slower convergence rate than FISTA
- ❌ Performance depends on coordinate ordering
Convergence Analysis
Both methods track:
- Iterations: Number of outer iterations
- Objective value: Final f(x) + g(x)
- Sparsity: Number of non-zero coefficients (for Lasso)
- Constraint violation: ‖max(0, -x)‖ for non-negativity
- Elapsed time: Total optimization time
Running the Examples
cargo run --example convex_optimization
The examples demonstrate:
- Lasso with FISTA (20 features, 50 samples)
- Non-negative LS with FISTA (10 features, 30 samples)
- High-dimensional Lasso with CD (100 features, 30 samples)
- Box-constrained QP with CD (15 variables)
- FISTA vs CD comparison (30 features, 50 samples)
Practical Tips
For FISTA
- Step size: Start with α = 0.01, use line search or backtracking
- Tolerance: Set to 1e-4 to 1e-6 depending on accuracy needs
- Restart: Implement adaptive restart for non-strongly convex problems
- Acceleration: Always use Nesterov momentum for faster convergence
For Coordinate Descent
- Ordering: Cyclic (1,2,...,n) is simplest, random can help
- Convergence: Check ‖x^k - x^{k-1}‖ < tol for stopping
- Updates: Precompute any expensive quantities (e.g., column norms)
- Warm starts: Initialize with previous solution when solving sequence of problems
Comparison Summary
Solution Quality: Both methods find nearly identical solutions (‖x_FISTA - x_CD‖ < 1e-5)
Speed:
- FISTA: Faster for moderate n (~30-100)
- Coordinate Descent: Faster for large n (>100)
Memory:
- FISTA: O(n) gradient storage
- Coordinate Descent: O(n) solution only
Ease of Use:
- FISTA: Requires step size tuning
- Coordinate Descent: Requires coordinate update implementation
Code Location
See examples/convex_optimization.rs for full implementation.