Case Study: Convex Optimization

This example demonstrates Phase 2 convex optimization methods designed for composite problems with non-smooth regularization.

Overview

Two specialized algorithms are covered:

  • FISTA (Fast Iterative Shrinkage-Thresholding Algorithm)
  • Coordinate Descent

Both methods excel at solving composite optimization:

minimize f(x) + g(x)

where f is smooth (differentiable) and g is "simple" (has easy proximal operator).

Mathematical Background

FISTA

Problem: minimize f(x) + g(x)

Key idea: Proximal gradient method with Nesterov acceleration

Achieves: O(1/k²) convergence (faster than standard gradient descent's O(1/k))

Proximal operator: prox_g(v, α) = argmin_x {½‖x - v‖² + α·g(x)}

Coordinate Descent

Problem: minimize f(x)

Key idea: Update one coordinate at a time

Algorithm: x^(k+1)i = argmin_z f(x^(k)1, ..., x^(k){i-1}, z, x^(k){i+1}, ..., x^(k)_n)

Particularly effective when:

  • Coordinate updates have closed-form solutions
  • Problem dimension is very high (n >> m)
  • Hessian is expensive to compute

Examples Covered

1. Lasso Regression with FISTA

Problem: minimize ½‖Ax - b‖² + λ‖x‖₁

The classic Lasso problem:

  • Smooth part: f(x) = ½‖Ax - b‖² (least squares)
  • Non-smooth part: g(x) = λ‖x‖₁ (L1 regularization for sparsity)
  • Proximal operator: Soft-thresholding

Demonstrates sparse recovery with only 3 non-zero coefficients out of 20 features.

2. Non-Negative Least Squares with FISTA

Problem: minimize ½‖Ax - b‖² subject to x ≥ 0

Applications:

  • Spectral unmixing
  • Image processing
  • Chemometrics

Uses projection onto non-negative orthant as proximal operator.

3. High-Dimensional Lasso with Coordinate Descent

Problem: minimize ½‖Ax - b‖² + λ‖x‖₁ (n >> m)

With 100 features and only 30 samples (n >> m), demonstrates:

  • Coordinate Descent efficiency in high dimensions
  • Closed-form soft-thresholding updates
  • Sparse recovery (5 non-zero out of 100)

4. Box-Constrained Quadratic Programming

Problem: minimize ½xᵀQx - cᵀx subject to l ≤ x ≤ u

Coordinate Descent with projection:

  • Each coordinate update is a simple 1D optimization
  • Project onto box constraints [l, u]
  • Track active constraints (variables at bounds)

5. FISTA vs Coordinate Descent Comparison

Side-by-side comparison on the same Lasso problem:

  • Convergence behavior
  • Computational cost
  • Solution quality

Proximal Operators

Key proximal operators used in examples:

Soft-Thresholding (L1 norm)

prox::soft_threshold(v, λ) = {
    v_i - λ  if v_i > λ
    0        if |v_i| ≤ λ
    v_i + λ  if v_i < -λ
}

Non-negative Projection

prox::nonnegative(v) = max(v, 0)

Box Projection

prox::box(v, l, u) = clamp(v, l, u)

Performance Comparison

MethodProblem TypeIterationsMemoryBest For
FISTAComposite f+gLow (~50-200)O(n)General composite problems
Coordinate DescentSeparable updatesMedium (~100-500)O(n)High-dimensional (n >> m)

Key Insights

When to Use FISTA

  • ✅ General composite optimization (smooth + non-smooth)
  • ✅ Fast O(1/k²) convergence with Nesterov acceleration
  • ✅ Works well for medium-scale problems
  • ✅ Proximal operator available in closed form
  • ❌ Requires Lipschitz constant estimation (step size tuning)

When to Use Coordinate Descent

  • ✅ High-dimensional problems (n >> m)
  • ✅ Coordinate updates have closed-form solutions
  • ✅ Very simple implementation
  • ✅ No global gradients needed
  • ❌ Slower convergence rate than FISTA
  • ❌ Performance depends on coordinate ordering

Convergence Analysis

Both methods track:

  • Iterations: Number of outer iterations
  • Objective value: Final f(x) + g(x)
  • Sparsity: Number of non-zero coefficients (for Lasso)
  • Constraint violation: ‖max(0, -x)‖ for non-negativity
  • Elapsed time: Total optimization time

Running the Examples

cargo run --example convex_optimization

The examples demonstrate:

  1. Lasso with FISTA (20 features, 50 samples)
  2. Non-negative LS with FISTA (10 features, 30 samples)
  3. High-dimensional Lasso with CD (100 features, 30 samples)
  4. Box-constrained QP with CD (15 variables)
  5. FISTA vs CD comparison (30 features, 50 samples)

Practical Tips

For FISTA

  1. Step size: Start with α = 0.01, use line search or backtracking
  2. Tolerance: Set to 1e-4 to 1e-6 depending on accuracy needs
  3. Restart: Implement adaptive restart for non-strongly convex problems
  4. Acceleration: Always use Nesterov momentum for faster convergence

For Coordinate Descent

  1. Ordering: Cyclic (1,2,...,n) is simplest, random can help
  2. Convergence: Check ‖x^k - x^{k-1}‖ < tol for stopping
  3. Updates: Precompute any expensive quantities (e.g., column norms)
  4. Warm starts: Initialize with previous solution when solving sequence of problems

Comparison Summary

Solution Quality: Both methods find nearly identical solutions (‖x_FISTA - x_CD‖ < 1e-5)

Speed:

  • FISTA: Faster for moderate n (~30-100)
  • Coordinate Descent: Faster for large n (>100)

Memory:

  • FISTA: O(n) gradient storage
  • Coordinate Descent: O(n) solution only

Ease of Use:

  • FISTA: Requires step size tuning
  • Coordinate Descent: Requires coordinate update implementation

Code Location

See examples/convex_optimization.rs for full implementation.