Chaos Engineering for ML Systems

Chaos engineering tests ML system resilience by intentionally injecting failures, ensuring models degrade gracefully under adverse conditions.

Why Chaos for ML?

ML systems have unique failure modes:

FailureTraditionalML System
Network partitionTimeout, retryStale model, wrong predictions
CPU spikeSlow responseInference latency spike
Memory pressureOOM crashModel unload, cold start
Data corruptionParse errorSilent wrong predictions

Chaos Principles

1. Build Hypothesis

"The model should maintain >95% accuracy when inference latency exceeds 100ms."

2. Vary Real-World Events

  • Network delays
  • Resource exhaustion
  • Model version mismatches
  • Input data anomalies

3. Run in Production (Carefully)

Test in production-like environments with safeguards.

4. Minimize Blast Radius

Start small, expand gradually.

ML-Specific Chaos Experiments

Model Degradation

// Inject noise into model weights
fn chaos_weight_noise(model: &mut Model, std: f32) {
    for param in model.parameters_mut() {
        let noise = random_normal(param.shape(), 0.0, std);
        param.add_(&noise);
    }
}

Test: Does accuracy degrade gracefully or catastrophically?

Input Perturbation

// Add adversarial noise to inputs
fn chaos_input_noise(input: &mut Tensor, epsilon: f32) {
    let noise = random_uniform(input.shape(), -epsilon, epsilon);
    input.add_(&noise);
}

Latency Injection

fn chaos_latency(base_latency: Duration) -> Duration {
    let multiplier = if random() < 0.1 {
        10.0  // 10% chance of 10x latency
    } else {
        1.0
    };
    base_latency * multiplier
}

Feature Dropout

// Simulate missing features
fn chaos_feature_dropout(features: &mut Tensor, drop_rate: f32) {
    let mask = random_bernoulli(features.shape(), 1.0 - drop_rate);
    features.mul_(&mask);
}

Chaos Scenarios

1. Model Loading Failure

Experiment: Block model download
Expected: Fall back to cached model or default behavior
Metric: Error rate during failover

2. Stale Model

Experiment: Serve outdated model version
Expected: Accuracy within acceptable bounds
Metric: Prediction drift from current model

3. Inference Timeout

Experiment: Add 5s delay to inference
Expected: Return cached/default prediction
Metric: User experience degradation

4. OOM During Inference

Experiment: Exhaust memory mid-batch
Expected: Graceful degradation, not crash
Metric: Recovery time

5. Data Pipeline Failure

Experiment: Corrupt feature pipeline output
Expected: Detect anomaly, reject inputs
Metric: False positive/negative rate

Implementation

Fault Injection Points

Input → [Chaos: Corruption] → Preprocessing
            │
            ▼
       → [Chaos: Delay] → Model
            │
            ▼
       → [Chaos: Noise] → Output

Chaos Flags

pub struct ChaosConfig {
    pub enabled: bool,
    pub latency_injection: Option<Duration>,
    pub error_rate: f32,
    pub weight_noise_std: f32,
    pub feature_drop_rate: f32,
}

Controlled Rollout

fn should_inject_chaos(user_id: &str, experiment: &str) -> bool {
    // Consistent hashing for reproducibility
    let hash = hash(format!("{}:{}", user_id, experiment));
    hash % 100 < 5  // 5% of traffic
}

Monitoring During Chaos

MetricNormalDuring ChaosAction
Accuracy95%>90%Continue
Accuracy95%<80%Halt
Latency p99100ms<500msContinue
Error rate0.1%<1%Continue

Automatic Halt

fn chaos_watchdog(metrics: &Metrics) -> bool {
    if metrics.error_rate > 0.05 {
        log!("Halting chaos: error rate too high");
        return false;  // Stop chaos
    }
    true  // Continue
}

Game Days

Scheduled chaos exercises:

  1. Announce the game day
  2. Define success criteria
  3. Execute chaos scenarios
  4. Observe system behavior
  5. Retrospect and improve

Chaos Libraries

Rust

use renacer::chaos::{inject_latency, corrupt_tensor};

#[chaos_experiment]
fn test_model_resilience() {
    inject_latency(Duration::from_millis(100));
    let result = model.predict(&input);
    assert!(result.confidence > 0.5);
}

Integration

[features]
chaos-basic = []
chaos-network = ["chaos-basic"]
chaos-byzantine = ["chaos-basic"]
chaos-full = ["chaos-network", "chaos-byzantine"]

Best Practices

  1. Start in staging, not production
  2. Small blast radius initially
  3. Monitor everything during experiments
  4. Automatic halt on critical metrics
  5. Document findings and fixes
  6. Regular game days (quarterly)

References