Metaheuristics Theory

Metaheuristics are high-level problem-solving strategies for optimization problems where exact algorithms are impractical. Unlike gradient-based methods, they don't require derivatives and can escape local optima.

Why Metaheuristics?

Traditional optimization has limitations:

Method	Limitation
Gradient Descent	Requires differentiable objectives
Newton's Method	Requires Hessian computation
Convex Optimization	Assumes convex landscape
Grid Search	Exponential scaling with dimensions

Metaheuristics address these by:

Derivative-free: Work with black-box objectives
Global search: Escape local optima
Versatile: Handle mixed continuous/discrete spaces

Algorithm Categories

Perturbative Metaheuristics

Modify complete solutions through perturbation operators:

┌─────────────────────────────────────────────────┐
│  Population-Based                               │
│  ┌─────────────────┐  ┌─────────────────────┐  │
│  │ Differential    │  │ Particle Swarm      │  │
│  │ Evolution (DE)  │  │ Optimization (PSO)  │  │
│  │                 │  │                     │  │
│  │ v = a + F(b-c)  │  │ v = wv + c₁r₁(p-x) │  │
│  │                 │  │     + c₂r₂(g-x)    │  │
│  └─────────────────┘  └─────────────────────┘  │
│                                                 │
│  ┌─────────────────┐  ┌─────────────────────┐  │
│  │ Genetic         │  │ CMA-ES              │  │
│  │ Algorithm (GA)  │  │                     │  │
│  │                 │  │ Covariance Matrix   │  │
│  │ Selection →     │  │ Adaptation          │  │
│  │ Crossover →     │  │                     │  │
│  │ Mutation        │  │ N(m, σ²C)           │  │
│  └─────────────────┘  └─────────────────────┘  │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│  Single-Solution                                │
│  ┌─────────────────┐  ┌─────────────────────┐  │
│  │ Simulated       │  │ Hill Climbing       │  │
│  │ Annealing (SA)  │  │                     │  │
│  │                 │  │ Always accept       │  │
│  │ Accept worse    │  │ improvements        │  │
│  │ with P=e^(-Δ/T) │  │                     │  │
│  └─────────────────┘  └─────────────────────┘  │
└─────────────────────────────────────────────────┘

Constructive Metaheuristics

Build solutions incrementally:

┌─────────────────────────────────────────────────┐
│  Ant Colony Optimization (ACO)                  │
│                                                 │
│  τᵢⱼ(t+1) = (1-ρ)τᵢⱼ(t) + Δτᵢⱼ                │
│                                                 │
│  Pheromone guides probabilistic construction    │
│  Best for: TSP, routing, scheduling             │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│  Tabu Search                                    │
│                                                 │
│  Memory-based local search                      │
│  Tabu list prevents cycling                     │
│  Aspiration criteria allow exceptions           │
└─────────────────────────────────────────────────┘

Differential Evolution (DE)

DE is the primary algorithm in Aprender's metaheuristics module. It's particularly effective for continuous hyperparameter optimization.

Algorithm

For each target vector xᵢ in population:
  1. Mutation:    v = xₐ + F·(xᵦ - xᵧ)     # difference vector
  2. Crossover:   u = binomial(xᵢ, v, CR)  # trial vector
  3. Selection:   xᵢ' = u if f(u) ≤ f(xᵢ)  # greedy selection

Mutation Strategies

Strategy	Formula	Characteristics
DE/rand/1/bin	v = xₐ + F(xᵦ - xᵧ)	Good exploration
DE/best/1/bin	v = x_best + F(xₐ - xᵦ)	Fast convergence
DE/current-to-best/1/bin	v = xᵢ + F(x_best - xᵢ) + F(xₐ - xᵦ)	Balanced
DE/rand/2/bin	v = xₐ + F(xᵦ - xᵧ) + F(xδ - xε)	More exploration

Adaptive Variants

JADE (Zhang & Sanderson, 2009):

Adapts F and CR based on successful mutations
External archive of inferior solutions
μ_F updated via Lehmer mean
μ_CR updated via weighted arithmetic mean

SHADE (Tanabe & Fukunaga, 2013):

Success-history based parameter adaptation
Circular memory buffer for F and CR
More robust than JADE on multimodal functions

Search Space Abstraction

Aprender uses a unified SearchSpace enum:

pub enum SearchSpace {
    // Continuous optimization (HPO, function optimization)
    Continuous { dim: usize, lower: Vec<f64>, upper: Vec<f64> },

    // Mixed continuous/discrete (neural architecture search)
    Mixed { dim: usize, lower: Vec<f64>, upper: Vec<f64>, discrete_dims: Vec<usize> },

    // Binary optimization (feature selection)
    Binary { dim: usize },

    // Permutation problems (TSP, scheduling)
    Permutation { size: usize },

    // Graph problems (routing, network design)
    Graph { num_nodes: usize, adjacency: Vec<Vec<(usize, f64)>>, heuristic: Option<Vec<Vec<f64>>> },
}

Budget Control

Three termination strategies:

pub enum Budget {
    // Precise evaluation counting (recommended for benchmarks)
    Evaluations(usize),

    // Generation/iteration based
    Iterations(usize),

    // Early stopping with convergence detection
    Convergence {
        patience: usize,      // iterations without improvement
        min_delta: f64,       // minimum improvement threshold
        max_evaluations: usize, // safety bound
    },
}

Active Learning (Muda Elimination)

Traditional batch generation ("Push System") produces many redundant samples. Active Learning implements a "Pull System" - only generating samples while uncertainty is high (Settles, 2009).

┌─────────────────────────────────────────────────────────────┐
│  Push System (Wasteful)          Pull System (Lean)         │
│  ┌─────────────────────┐         ┌─────────────────────┐   │
│  │ Generate 100K       │         │ Generate batch      │   │
│  │ samples blindly     │         │ while uncertain     │   │
│  │         ↓           │         │         ↓           │   │
│  │ 90% redundant       │         │ Evaluate & update   │   │
│  │ (low info gain)     │         │         ↓           │   │
│  │         ↓           │         │ Check uncertainty   │   │
│  │ Wasted compute      │         │         ↓           │   │
│  └─────────────────────┘         │ Stop when confident │   │
│                                  └─────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Uncertainty Estimation

Uses coefficient of variation (CV = σ/μ):

Low CV: Consistent scores → high confidence → stop
High CV: Variable scores → low confidence → continue

Usage

use aprender::automl::{ActiveLearningSearch, DESearch, SearchStrategy};

let base = DESearch::new(10_000).with_jade();
let mut search = ActiveLearningSearch::new(base)
    .with_uncertainty_threshold(0.1)  // Stop when CV < 0.1
    .with_min_samples(20);            // Need at least 20 samples

// Pull system loop
while !search.should_stop() {
    let trials = search.suggest(&space, 10);
    if trials.is_empty() { break; }

    let results = evaluate(&trials);
    search.update(&results);  // Updates uncertainty estimate
}
// Stops early when confidence saturates

When to Use Metaheuristics

Good Use Cases

Hyperparameter Optimization: Learning rate, regularization, architecture choices
Black-box Functions: Simulations, expensive experiments
Multimodal Landscapes: Many local optima
Mixed Search Spaces: Continuous + categorical variables

When to Prefer Other Methods

Convex Problems: Use convex optimizers (faster convergence)
Differentiable Objectives: Gradient methods are more efficient
Very Low Budget: Random search may be comparable
High Dimensions (>100): Consider Bayesian optimization

Benchmark Functions

Standard test functions for algorithm comparison:

Function	Formula	Characteristics
Sphere	f(x) = Σxᵢ²	Unimodal, separable
Rosenbrock	f(x) = Σ[100(xᵢ₊₁-xᵢ²)² + (1-xᵢ)²]	Unimodal, narrow valley
Rastrigin	f(x) = 10n + Σ[xᵢ²-10cos(2πxᵢ)]	Highly multimodal
Ackley	f(x) = -20exp(-0.2√(Σxᵢ²/n)) - exp(Σcos(2πxᵢ)/n) + 20 + e	Multimodal, nearly flat

References

Storn, R. & Price, K. (1997). "Differential Evolution - A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces." Journal of Global Optimization, 11(4), 341-359.
Zhang, J. & Sanderson, A.C. (2009). "JADE: Adaptive Differential Evolution with Optional External Archive." IEEE Transactions on Evolutionary Computation, 13(5), 945-958.
Tanabe, R. & Fukunaga, A. (2013). "Success-History Based Parameter Adaptation for Differential Evolution." IEEE Congress on Evolutionary Computation, 71-78.
Kennedy, J. & Eberhart, R. (1995). "Particle Swarm Optimization." IEEE International Conference on Neural Networks, 1942-1948.
Hansen, N. (2016). "The CMA Evolution Strategy: A Tutorial." arXiv:1604.00772.
Settles, B. (2009). "Active Learning Literature Survey." University of Wisconsin-Madison Computer Sciences Technical Report 1648.

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning