Chaos Engineering for ML Systems
Chaos engineering tests ML system resilience by intentionally injecting failures, ensuring models degrade gracefully under adverse conditions.
Why Chaos for ML?
ML systems have unique failure modes:
| Failure | Traditional | ML System |
|---|---|---|
| Network partition | Timeout, retry | Stale model, wrong predictions |
| CPU spike | Slow response | Inference latency spike |
| Memory pressure | OOM crash | Model unload, cold start |
| Data corruption | Parse error | Silent wrong predictions |
Chaos Principles
1. Build Hypothesis
"The model should maintain >95% accuracy when inference latency exceeds 100ms."
2. Vary Real-World Events
- Network delays
- Resource exhaustion
- Model version mismatches
- Input data anomalies
3. Run in Production (Carefully)
Test in production-like environments with safeguards.
4. Minimize Blast Radius
Start small, expand gradually.
ML-Specific Chaos Experiments
Model Degradation
// Inject noise into model weights
fn chaos_weight_noise(model: &mut Model, std: f32) {
for param in model.parameters_mut() {
let noise = random_normal(param.shape(), 0.0, std);
param.add_(&noise);
}
}
Test: Does accuracy degrade gracefully or catastrophically?
Input Perturbation
// Add adversarial noise to inputs
fn chaos_input_noise(input: &mut Tensor, epsilon: f32) {
let noise = random_uniform(input.shape(), -epsilon, epsilon);
input.add_(&noise);
}
Latency Injection
fn chaos_latency(base_latency: Duration) -> Duration {
let multiplier = if random() < 0.1 {
10.0 // 10% chance of 10x latency
} else {
1.0
};
base_latency * multiplier
}
Feature Dropout
// Simulate missing features
fn chaos_feature_dropout(features: &mut Tensor, drop_rate: f32) {
let mask = random_bernoulli(features.shape(), 1.0 - drop_rate);
features.mul_(&mask);
}
Chaos Scenarios
1. Model Loading Failure
Experiment: Block model download
Expected: Fall back to cached model or default behavior
Metric: Error rate during failover
2. Stale Model
Experiment: Serve outdated model version
Expected: Accuracy within acceptable bounds
Metric: Prediction drift from current model
3. Inference Timeout
Experiment: Add 5s delay to inference
Expected: Return cached/default prediction
Metric: User experience degradation
4. OOM During Inference
Experiment: Exhaust memory mid-batch
Expected: Graceful degradation, not crash
Metric: Recovery time
5. Data Pipeline Failure
Experiment: Corrupt feature pipeline output
Expected: Detect anomaly, reject inputs
Metric: False positive/negative rate
Implementation
Fault Injection Points
Input → [Chaos: Corruption] → Preprocessing
│
▼
→ [Chaos: Delay] → Model
│
▼
→ [Chaos: Noise] → Output
Chaos Flags
pub struct ChaosConfig {
pub enabled: bool,
pub latency_injection: Option<Duration>,
pub error_rate: f32,
pub weight_noise_std: f32,
pub feature_drop_rate: f32,
}
Controlled Rollout
fn should_inject_chaos(user_id: &str, experiment: &str) -> bool {
// Consistent hashing for reproducibility
let hash = hash(format!("{}:{}", user_id, experiment));
hash % 100 < 5 // 5% of traffic
}
Monitoring During Chaos
| Metric | Normal | During Chaos | Action |
|---|---|---|---|
| Accuracy | 95% | >90% | Continue |
| Accuracy | 95% | <80% | Halt |
| Latency p99 | 100ms | <500ms | Continue |
| Error rate | 0.1% | <1% | Continue |
Automatic Halt
fn chaos_watchdog(metrics: &Metrics) -> bool {
if metrics.error_rate > 0.05 {
log!("Halting chaos: error rate too high");
return false; // Stop chaos
}
true // Continue
}
Game Days
Scheduled chaos exercises:
- Announce the game day
- Define success criteria
- Execute chaos scenarios
- Observe system behavior
- Retrospect and improve
Chaos Libraries
Rust
use renacer::chaos::{inject_latency, corrupt_tensor};
#[chaos_experiment]
fn test_model_resilience() {
inject_latency(Duration::from_millis(100));
let result = model.predict(&input);
assert!(result.confidence > 0.5);
}
Integration
[features]
chaos-basic = []
chaos-network = ["chaos-basic"]
chaos-byzantine = ["chaos-basic"]
chaos-full = ["chaos-network", "chaos-byzantine"]
Best Practices
- Start in staging, not production
- Small blast radius initially
- Monitor everything during experiments
- Automatic halt on critical metrics
- Document findings and fixes
- Regular game days (quarterly)
References
- Basiri, A., et al. (2016). "Chaos Engineering." IEEE Software.
- Principles of Chaos Engineering: https://principlesofchaos.org
- Renacer (Rust chaos library): https://crates.io/crates/renacer