Metaheuristics Theory
Metaheuristics are high-level problem-solving strategies for optimization problems where exact algorithms are impractical. Unlike gradient-based methods, they don't require derivatives and can escape local optima.
Why Metaheuristics?
Traditional optimization has limitations:
| Method | Limitation |
|---|---|
| Gradient Descent | Requires differentiable objectives |
| Newton's Method | Requires Hessian computation |
| Convex Optimization | Assumes convex landscape |
| Grid Search | Exponential scaling with dimensions |
Metaheuristics address these by:
- Derivative-free: Work with black-box objectives
- Global search: Escape local optima
- Versatile: Handle mixed continuous/discrete spaces
Algorithm Categories
Perturbative Metaheuristics
Modify complete solutions through perturbation operators:
┌─────────────────────────────────────────────────┐
│ Population-Based │
│ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ Differential │ │ Particle Swarm │ │
│ │ Evolution (DE) │ │ Optimization (PSO) │ │
│ │ │ │ │ │
│ │ v = a + F(b-c) │ │ v = wv + c₁r₁(p-x) │ │
│ │ │ │ + c₂r₂(g-x) │ │
│ └─────────────────┘ └─────────────────────┘ │
│ │
│ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ Genetic │ │ CMA-ES │ │
│ │ Algorithm (GA) │ │ │ │
│ │ │ │ Covariance Matrix │ │
│ │ Selection → │ │ Adaptation │ │
│ │ Crossover → │ │ │ │
│ │ Mutation │ │ N(m, σ²C) │ │
│ └─────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ Single-Solution │
│ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ Simulated │ │ Hill Climbing │ │
│ │ Annealing (SA) │ │ │ │
│ │ │ │ Always accept │ │
│ │ Accept worse │ │ improvements │ │
│ │ with P=e^(-Δ/T) │ │ │ │
│ └─────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────┘
Constructive Metaheuristics
Build solutions incrementally:
┌─────────────────────────────────────────────────┐
│ Ant Colony Optimization (ACO) │
│ │
│ τᵢⱼ(t+1) = (1-ρ)τᵢⱼ(t) + Δτᵢⱼ │
│ │
│ Pheromone guides probabilistic construction │
│ Best for: TSP, routing, scheduling │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ Tabu Search │
│ │
│ Memory-based local search │
│ Tabu list prevents cycling │
│ Aspiration criteria allow exceptions │
└─────────────────────────────────────────────────┘
Differential Evolution (DE)
DE is the primary algorithm in Aprender's metaheuristics module. It's particularly effective for continuous hyperparameter optimization.
Algorithm
For each target vector xᵢ in population:
1. Mutation: v = xₐ + F·(xᵦ - xᵧ) # difference vector
2. Crossover: u = binomial(xᵢ, v, CR) # trial vector
3. Selection: xᵢ' = u if f(u) ≤ f(xᵢ) # greedy selection
Mutation Strategies
| Strategy | Formula | Characteristics |
|---|---|---|
| DE/rand/1/bin | v = xₐ + F(xᵦ - xᵧ) | Good exploration |
| DE/best/1/bin | v = x_best + F(xₐ - xᵦ) | Fast convergence |
| DE/current-to-best/1/bin | v = xᵢ + F(x_best - xᵢ) + F(xₐ - xᵦ) | Balanced |
| DE/rand/2/bin | v = xₐ + F(xᵦ - xᵧ) + F(xδ - xε) | More exploration |
Adaptive Variants
JADE (Zhang & Sanderson, 2009):
- Adapts F and CR based on successful mutations
- External archive of inferior solutions
- μ_F updated via Lehmer mean
- μ_CR updated via weighted arithmetic mean
SHADE (Tanabe & Fukunaga, 2013):
- Success-history based parameter adaptation
- Circular memory buffer for F and CR
- More robust than JADE on multimodal functions
Search Space Abstraction
Aprender uses a unified SearchSpace enum:
pub enum SearchSpace {
// Continuous optimization (HPO, function optimization)
Continuous { dim: usize, lower: Vec<f64>, upper: Vec<f64> },
// Mixed continuous/discrete (neural architecture search)
Mixed { dim: usize, lower: Vec<f64>, upper: Vec<f64>, discrete_dims: Vec<usize> },
// Binary optimization (feature selection)
Binary { dim: usize },
// Permutation problems (TSP, scheduling)
Permutation { size: usize },
// Graph problems (routing, network design)
Graph { num_nodes: usize, adjacency: Vec<Vec<(usize, f64)>>, heuristic: Option<Vec<Vec<f64>>> },
}
Budget Control
Three termination strategies:
pub enum Budget {
// Precise evaluation counting (recommended for benchmarks)
Evaluations(usize),
// Generation/iteration based
Iterations(usize),
// Early stopping with convergence detection
Convergence {
patience: usize, // iterations without improvement
min_delta: f64, // minimum improvement threshold
max_evaluations: usize, // safety bound
},
}
Active Learning (Muda Elimination)
Traditional batch generation ("Push System") produces many redundant samples. Active Learning implements a "Pull System" - only generating samples while uncertainty is high (Settles, 2009).
┌─────────────────────────────────────────────────────────────┐
│ Push System (Wasteful) Pull System (Lean) │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Generate 100K │ │ Generate batch │ │
│ │ samples blindly │ │ while uncertain │ │
│ │ ↓ │ │ ↓ │ │
│ │ 90% redundant │ │ Evaluate & update │ │
│ │ (low info gain) │ │ ↓ │ │
│ │ ↓ │ │ Check uncertainty │ │
│ │ Wasted compute │ │ ↓ │ │
│ └─────────────────────┘ │ Stop when confident │ │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Uncertainty Estimation
Uses coefficient of variation (CV = σ/μ):
- Low CV: Consistent scores → high confidence → stop
- High CV: Variable scores → low confidence → continue
Usage
use aprender::automl::{ActiveLearningSearch, DESearch, SearchStrategy};
let base = DESearch::new(10_000).with_jade();
let mut search = ActiveLearningSearch::new(base)
.with_uncertainty_threshold(0.1) // Stop when CV < 0.1
.with_min_samples(20); // Need at least 20 samples
// Pull system loop
while !search.should_stop() {
let trials = search.suggest(&space, 10);
if trials.is_empty() { break; }
let results = evaluate(&trials);
search.update(&results); // Updates uncertainty estimate
}
// Stops early when confidence saturates
When to Use Metaheuristics
Good Use Cases
- Hyperparameter Optimization: Learning rate, regularization, architecture choices
- Black-box Functions: Simulations, expensive experiments
- Multimodal Landscapes: Many local optima
- Mixed Search Spaces: Continuous + categorical variables
When to Prefer Other Methods
- Convex Problems: Use convex optimizers (faster convergence)
- Differentiable Objectives: Gradient methods are more efficient
- Very Low Budget: Random search may be comparable
- High Dimensions (>100): Consider Bayesian optimization
Benchmark Functions
Standard test functions for algorithm comparison:
| Function | Formula | Characteristics |
|---|---|---|
| Sphere | f(x) = Σxᵢ² | Unimodal, separable |
| Rosenbrock | f(x) = Σ[100(xᵢ₊₁-xᵢ²)² + (1-xᵢ)²] | Unimodal, narrow valley |
| Rastrigin | f(x) = 10n + Σ[xᵢ²-10cos(2πxᵢ)] | Highly multimodal |
| Ackley | f(x) = -20exp(-0.2√(Σxᵢ²/n)) - exp(Σcos(2πxᵢ)/n) + 20 + e | Multimodal, nearly flat |
References
-
Storn, R. & Price, K. (1997). "Differential Evolution - A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces." Journal of Global Optimization, 11(4), 341-359.
-
Zhang, J. & Sanderson, A.C. (2009). "JADE: Adaptive Differential Evolution with Optional External Archive." IEEE Transactions on Evolutionary Computation, 13(5), 945-958.
-
Tanabe, R. & Fukunaga, A. (2013). "Success-History Based Parameter Adaptation for Differential Evolution." IEEE Congress on Evolutionary Computation, 71-78.
-
Kennedy, J. & Eberhart, R. (1995). "Particle Swarm Optimization." IEEE International Conference on Neural Networks, 1942-1948.
-
Hansen, N. (2016). "The CMA Evolution Strategy: A Tutorial." arXiv:1604.00772.
-
Settles, B. (2009). "Active Learning Literature Survey." University of Wisconsin-Madison Computer Sciences Technical Report 1648.