Chaos Engineering for ML Systems

Chaos engineering tests ML system resilience by intentionally injecting failures, ensuring models degrade gracefully under adverse conditions.

Why Chaos for ML?

ML systems have unique failure modes:

Failure	Traditional	ML System
Network partition	Timeout, retry	Stale model, wrong predictions
CPU spike	Slow response	Inference latency spike
Memory pressure	OOM crash	Model unload, cold start
Data corruption	Parse error	Silent wrong predictions

Chaos Principles

1. Build Hypothesis

"The model should maintain >95% accuracy when inference latency exceeds 100ms."

2. Vary Real-World Events

Network delays
Resource exhaustion
Model version mismatches
Input data anomalies

3. Run in Production (Carefully)

Test in production-like environments with safeguards.

4. Minimize Blast Radius

Start small, expand gradually.

ML-Specific Chaos Experiments

Model Degradation

// Inject noise into model weights
fn chaos_weight_noise(model: &mut Model, std: f32) {
    for param in model.parameters_mut() {
        let noise = random_normal(param.shape(), 0.0, std);
        param.add_(&noise);
    }
}

Test: Does accuracy degrade gracefully or catastrophically?

Input Perturbation

// Add adversarial noise to inputs
fn chaos_input_noise(input: &mut Tensor, epsilon: f32) {
    let noise = random_uniform(input.shape(), -epsilon, epsilon);
    input.add_(&noise);
}

Latency Injection

fn chaos_latency(base_latency: Duration) -> Duration {
    let multiplier = if random() < 0.1 {
        10.0  // 10% chance of 10x latency
    } else {
        1.0
    };
    base_latency * multiplier
}

Feature Dropout

// Simulate missing features
fn chaos_feature_dropout(features: &mut Tensor, drop_rate: f32) {
    let mask = random_bernoulli(features.shape(), 1.0 - drop_rate);
    features.mul_(&mask);
}

Chaos Scenarios

1. Model Loading Failure

Experiment: Block model download
Expected: Fall back to cached model or default behavior
Metric: Error rate during failover

2. Stale Model

Experiment: Serve outdated model version
Expected: Accuracy within acceptable bounds
Metric: Prediction drift from current model

3. Inference Timeout

Experiment: Add 5s delay to inference
Expected: Return cached/default prediction
Metric: User experience degradation

4. OOM During Inference

Experiment: Exhaust memory mid-batch
Expected: Graceful degradation, not crash
Metric: Recovery time

5. Data Pipeline Failure

Experiment: Corrupt feature pipeline output
Expected: Detect anomaly, reject inputs
Metric: False positive/negative rate

Implementation

Fault Injection Points

Input → [Chaos: Corruption] → Preprocessing
            │
            ▼
       → [Chaos: Delay] → Model
            │
            ▼
       → [Chaos: Noise] → Output

Chaos Flags

pub struct ChaosConfig {
    pub enabled: bool,
    pub latency_injection: Option<Duration>,
    pub error_rate: f32,
    pub weight_noise_std: f32,
    pub feature_drop_rate: f32,
}

Controlled Rollout

fn should_inject_chaos(user_id: &str, experiment: &str) -> bool {
    // Consistent hashing for reproducibility
    let hash = hash(format!("{}:{}", user_id, experiment));
    hash % 100 < 5  // 5% of traffic
}

Monitoring During Chaos

Metric	Normal	During Chaos	Action
Accuracy	95%	>90%	Continue
Accuracy	95%	<80%	Halt
Latency p99	100ms	<500ms	Continue
Error rate	0.1%	<1%	Continue

Automatic Halt

fn chaos_watchdog(metrics: &Metrics) -> bool {
    if metrics.error_rate > 0.05 {
        log!("Halting chaos: error rate too high");
        return false;  // Stop chaos
    }
    true  // Continue
}

Game Days

Scheduled chaos exercises:

Announce the game day
Define success criteria
Execute chaos scenarios
Observe system behavior
Retrospect and improve

Chaos Libraries

Rust

use renacer::chaos::{inject_latency, corrupt_tensor};

#[chaos_experiment]
fn test_model_resilience() {
    inject_latency(Duration::from_millis(100));
    let result = model.predict(&input);
    assert!(result.confidence > 0.5);
}

Integration

[features]
chaos-basic = []
chaos-network = ["chaos-basic"]
chaos-byzantine = ["chaos-basic"]
chaos-full = ["chaos-network", "chaos-byzantine"]

Best Practices

Start in staging, not production
Small blast radius initially
Monitor everything during experiments
Automatic halt on critical metrics
Document findings and fixes
Regular game days (quarterly)

References

Basiri, A., et al. (2016). "Chaos Engineering." IEEE Software.
Principles of Chaos Engineering: https://principlesofchaos.org
Renacer (Rust chaos library): https://crates.io/crates/renacer

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning