Case Study: Pipeline Verification System
This case study demonstrates aprender's pipeline verification system for ML model debugging, implementing Toyota Way's Jidoka principle: built-in quality with automatic stop on first defect.
The Problem
When porting ML models between frameworks (PyTorch to Rust, ONNX to native, etc.), subtle numerical differences can cascade through the pipeline:
| Stage | Issue | Symptom |
|---|---|---|
| Preprocessing | Normalization sign flip | Complete output inversion |
| Encoder | Precision loss | Gradual drift in deeper layers |
| Attention | Softmax overflow | NaN propagation |
| Output | Quantization error | Wrong predictions |
Finding the root cause is like debugging a 10-stage pipeline with a single "wrong output" error message.
The Solution: Stage-by-Stage Ground Truth Verification
The verify module provides systematic comparison at each pipeline stage:
use aprender::verify::{Pipeline, GroundTruth, Tolerance};
let pipeline = Pipeline::builder("whisper-tiny")
.stage("mel")
.ground_truth_stats(-0.215, 0.448) // Expected mean, std
.tolerance(Tolerance::percent(5.0)) // 5% tolerance
.build_stage()
.stage("encoder")
.ground_truth_stats(0.0, 0.8)
.tolerance(Tolerance::percent(10.0))
.build_stage()
.build()
.expect("Pipeline definition error");
// Verify outputs against ground truth
let report = pipeline.verify(|stage_name| {
match stage_name {
"mel" => Some(GroundTruth::from_stats(-0.210, 0.450)),
"encoder" => Some(GroundTruth::from_stats(0.01, 0.78)),
_ => None,
}
});
assert!(report.all_passed());
Complete Example
Run: cargo run --example pipeline_verification
//! Pipeline Verification Example
//!
//! Demonstrates the verify module for ML pipeline debugging with:
//! - Stage-by-stage ground truth comparison
//! - Multiple tolerance types (percent, stats, KL divergence)
//! - Jidoka-style stop-on-failure behavior
//! - Detailed diagnostic output for failures
//!
//! Run with: `cargo run --example pipeline_verification`
use aprender::verify::{Delta, GroundTruth, Pipeline, StageStatus, Tolerance, VerifyReport};
fn main() {
println!("=== Pipeline Verification System ===\n");
println!("Toyota Way: Jidoka - Built-in quality with automatic stop on defect\n");
demo_basic_pipeline();
demo_failure_detection();
demo_continue_on_failure();
demo_stats_tolerance();
demo_ground_truth_from_data();
demo_cosine_similarity();
demo_kl_divergence();
demo_whisper_pipeline();
print_summary();
}
/// Part 1: Basic Pipeline with Percent Tolerance
fn demo_basic_pipeline() {
println!("--- Part 1: Basic Pipeline (Percent Tolerance) ---\n");
let pipeline = Pipeline::builder("audio-encoder")
.stage("mel_spectrogram")
.ground_truth_stats(-0.215, 0.448)
.tolerance(Tolerance::percent(5.0))
.description("Mel spectrogram extraction")
.build_stage()
.stage("encoder_layer_1")
.ground_truth_stats(0.0, 1.0)
.tolerance(Tolerance::percent(10.0))
.description("First encoder transformer layer")
.build_stage()
.stage("encoder_layer_2")
.ground_truth_stats(0.0, 1.0)
.tolerance(Tolerance::percent(10.0))
.description("Second encoder transformer layer")
.build_stage()
.build()
.expect("Failed to build pipeline");
println!("Pipeline: {}", pipeline.name());
println!("Stages: {}\n", pipeline.stages().len());
// Simulate outputs that pass verification
let report = pipeline.verify(|stage_name| match stage_name {
"mel_spectrogram" => Some(GroundTruth::from_stats(-0.210, 0.450)),
"encoder_layer_1" => Some(GroundTruth::from_stats(0.02, 0.98)),
"encoder_layer_2" => Some(GroundTruth::from_stats(-0.01, 1.02)),
_ => None,
});
print_report(&report);
}
/// Part 2: Detecting Sign Flip Errors
fn demo_failure_detection() {
println!("\n--- Part 2: Detecting Sign Flip Errors ---\n");
let pipeline = Pipeline::builder("audio-encoder")
.stage("mel_spectrogram")
.ground_truth_stats(-0.215, 0.448)
.tolerance(Tolerance::percent(5.0))
.build_stage()
.stage("encoder_layer_1")
.ground_truth_stats(0.0, 1.0)
.tolerance(Tolerance::percent(10.0))
.build_stage()
.stage("encoder_layer_2")
.ground_truth_stats(0.0, 1.0)
.tolerance(Tolerance::percent(10.0))
.build_stage()
.build()
.expect("Failed to build pipeline");
// Simulate a sign flip error in mel spectrogram
let report = pipeline.verify(|stage_name| match stage_name {
"mel_spectrogram" => Some(GroundTruth::from_stats(0.184, 0.448)), // SIGN FLIPPED!
"encoder_layer_1" | "encoder_layer_2" => Some(GroundTruth::from_stats(0.0, 1.0)),
_ => None,
});
print_report(&report);
// Show diagnosis for the failure
if let Some(failure) = report.first_failure() {
println!("\nDiagnosis for '{}' failure:", failure.name());
for diag in failure.diagnose() {
println!(" - {diag}");
}
}
}
/// Part 3: Continue-on-Failure Mode
fn demo_continue_on_failure() {
println!("\n--- Part 3: Continue-on-Failure Mode ---\n");
let pipeline = Pipeline::builder("full-analysis")
.stage("stage_a")
.ground_truth_stats(0.0, 1.0)
.tolerance(Tolerance::percent(5.0))
.build_stage()
.stage("stage_b")
.ground_truth_stats(0.0, 1.0)
.tolerance(Tolerance::percent(5.0))
.build_stage()
.stage("stage_c")
.ground_truth_stats(0.0, 1.0)
.tolerance(Tolerance::percent(5.0))
.build_stage()
.continue_on_failure() // Disable Jidoka for full analysis
.build()
.expect("Failed to build pipeline");
let report = pipeline.verify(|stage_name| match stage_name {
"stage_a" => Some(GroundTruth::from_stats(0.5, 1.0)), // FAIL
"stage_b" => Some(GroundTruth::from_stats(0.0, 0.98)), // PASS
"stage_c" => Some(GroundTruth::from_stats(0.3, 1.0)), // FAIL
_ => None,
});
println!("With continue_on_failure(), all stages are evaluated:");
print_report(&report);
}
/// Part 4: Stats-Based Tolerance
fn demo_stats_tolerance() {
println!("\n--- Part 4: Stats-Based Tolerance ---\n");
let pipeline = Pipeline::builder("precision-check")
.stage("high_precision")
.ground_truth_stats(0.0, 1.0)
.tolerance(Tolerance::stats(0.01, 0.02)) // Very tight
.build_stage()
.stage("normal_precision")
.ground_truth_stats(0.0, 1.0)
.tolerance(Tolerance::stats(0.1, 0.1)) // Normal tolerance
.build_stage()
.build()
.expect("Failed to build pipeline");
let report = pipeline.verify(|stage_name| match stage_name {
"high_precision" => Some(GroundTruth::from_stats(0.005, 1.01)),
"normal_precision" => Some(GroundTruth::from_stats(0.05, 0.95)),
_ => None,
});
print_report(&report);
}
/// Part 5: Ground Truth from Raw Data
fn demo_ground_truth_from_data() {
println!("\n--- Part 5: Ground Truth from Raw Data ---\n");
let reference_output = vec![0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0];
let gt = GroundTruth::from_slice(&reference_output);
println!("Ground truth computed from raw data:");
println!(" Mean: {:.4}", gt.mean());
println!(" Std: {:.4}", gt.std());
println!(" Min: {:.4}", gt.min());
println!(" Max: {:.4}", gt.max());
let our_output = vec![0.12, 0.19, 0.31, 0.38, 0.52, 0.58, 0.71, 0.79, 0.91, 0.98];
let our = GroundTruth::from_slice(&our_output);
let delta = Delta::compute(&our, >);
println!("\nDelta analysis:");
println!(" Mean delta: {:.4}", delta.mean_delta());
println!(" Std delta: {:.4}", delta.std_delta());
println!(" Percent: {:.2}%", delta.percent());
println!(" Sign flip: {}", delta.is_sign_flipped());
if let Some(cos) = delta.cosine() {
println!(" Cosine sim: {cos:.4}");
}
}
/// Part 6: Cosine Similarity Tolerance
fn demo_cosine_similarity() {
println!("\n--- Part 6: Cosine Similarity Tolerance ---\n");
let vec_a = vec![1.0, 2.0, 3.0, 4.0, 5.0];
let vec_b = vec![1.1, 1.9, 3.1, 3.9, 5.1];
let vec_c = vec![-1.0, -2.0, -3.0, -4.0, -5.0];
println!("Cosine similarity comparisons:");
println!(
" vec_a vs vec_b (similar): {:.4}",
Delta::cosine_similarity(&vec_a, &vec_b)
);
println!(
" vec_a vs vec_c (opposite): {:.4}",
Delta::cosine_similarity(&vec_a, &vec_c)
);
println!(
" vec_a vs vec_a (identical): {:.4}",
Delta::cosine_similarity(&vec_a, &vec_a)
);
}
/// Part 7: KL Divergence for Probability Distributions
fn demo_kl_divergence() {
println!("\n--- Part 7: KL Divergence ---\n");
let p = vec![0.25, 0.25, 0.25, 0.25]; // Uniform
let q = vec![0.5, 0.25, 0.125, 0.125]; // Skewed
println!("KL divergence (distribution comparison):");
println!(" Uniform vs Uniform: {:.4}", Delta::kl_divergence(&p, &p));
println!(" Uniform vs Skewed: {:.4}", Delta::kl_divergence(&p, &q));
}
/// Part 8: Real-World Whisper Pipeline Example
fn demo_whisper_pipeline() {
println!("\n--- Part 8: Whisper Pipeline (Real-World) ---\n");
let pipeline = Pipeline::builder("whisper-tiny")
.stage("mel")
.ground_truth_stats(-0.215, 0.448)
.tolerance(Tolerance::percent(5.0))
.description("Log-mel spectrogram (80 mel bins)")
.build_stage()
.stage("encoder_out")
.ground_truth_stats(0.0, 0.8)
.tolerance(Tolerance::percent(10.0))
.description("Encoder final output")
.build_stage()
.stage("decoder_logits")
.ground_truth_stats(0.0, 15.0)
.tolerance(Tolerance::percent(15.0))
.description("Decoder output logits")
.build_stage()
.stage("probs")
.ground_truth_stats(0.0001, 0.01)
.tolerance(Tolerance::percent(20.0))
.description("Softmax probabilities")
.build_stage()
.build()
.expect("Failed to build Whisper pipeline");
let report = pipeline.verify(|stage| match stage {
"mel" => Some(GroundTruth::from_stats(-0.220, 0.445)),
"encoder_out" => Some(GroundTruth::from_stats(0.01, 0.78)),
"decoder_logits" => Some(GroundTruth::from_stats(-0.5, 14.2)),
"probs" => Some(GroundTruth::from_stats(0.00012, 0.009)),
_ => None,
});
println!("Whisper-tiny pipeline verification:");
print_report(&report);
}
fn print_summary() {
println!("\n=== Summary ===\n");
println!("Pipeline verification enables:");
println!(" 1. Stage-by-stage ground truth comparison");
println!(" 2. Multiple tolerance types (percent, stats, cosine, KL)");
println!(" 3. Jidoka: Stop on first failure (or continue for full analysis)");
println!(" 4. Automatic diagnosis (sign flips, distribution shifts)");
println!(" 5. Visual reporting with pass/fail/skip status");
println!("\nUse cases:");
println!(" - ML model porting (PyTorch -> Rust)");
println!(" - Quantization validation");
println!(" - CI/CD regression testing");
println!(" - Audio/vision pipeline debugging");
println!("\n=== Done ===");
}
/// Print a verification report with colored output
fn print_report(report: &VerifyReport) {
println!("{}", report.summary());
println!();
for result in report.results() {
let status = result.status();
let icon = status.icon();
let color = status.color();
let reset = "\x1b[0m";
print!(" {color}{icon}{reset} {}", result.name());
if let Some(delta) = result.delta() {
print!(" (delta: {:.2}%)", delta.percent());
}
if status == StageStatus::Skipped {
print!(" [skipped due to prior failure]");
}
println!();
}
}
Key Features
1. Jidoka: Stop-on-First-Failure
By default, verification stops at the first failure (Toyota Way: stop the line when defect is detected):
// Default: Jidoka enabled
let pipeline = Pipeline::builder("model")
.stage("a").ground_truth_stats(0.0, 1.0).tolerance(Tolerance::percent(5.0)).build_stage()
.stage("b").ground_truth_stats(0.0, 1.0).tolerance(Tolerance::percent(5.0)).build_stage()
.stage("c").ground_truth_stats(0.0, 1.0).tolerance(Tolerance::percent(5.0)).build_stage()
.build()?;
// If stage "a" fails, "b" and "c" are skipped
// This prevents cascading failures from obscuring the root cause
For full analysis of all stages:
let pipeline = Pipeline::builder("full-analysis")
.stage("a").build_stage()
.stage("b").build_stage()
.stage("c").build_stage()
.continue_on_failure() // Evaluate ALL stages regardless of failures
.build()?;
2. Multiple Tolerance Types
// Simple percent tolerance
Tolerance::percent(5.0)
// Separate mean/std thresholds (for high-precision stages)
Tolerance::stats(0.01, 0.02) // mean <= 0.01, std <= 0.02
// Cosine similarity minimum (for embedding comparisons)
Tolerance::cosine(0.99) // Require 99% similarity
// KL divergence threshold (for probability distributions)
Tolerance::kl_divergence(0.1)
// Custom multi-criteria tolerance
Tolerance::custom()
.percent(10.0)
.mean_delta(0.1)
.cosine_min(0.95)
.build()
3. Ground Truth from Multiple Sources
// From known statistics (e.g., from reference implementation docs)
let gt = GroundTruth::from_stats(mean, std);
// From raw data (computed automatically)
let reference_output = vec![0.1, 0.2, 0.3, 0.4, 0.5];
let gt = GroundTruth::from_slice(&reference_output);
// Full statistics available
println!("Mean: {}, Std: {}, Min: {}, Max: {}",
gt.mean(), gt.std(), gt.min(), gt.max());
4. Delta Analysis
use aprender::verify::Delta;
let our = GroundTruth::from_slice(&our_output);
let reference = GroundTruth::from_slice(&ref_output);
let delta = Delta::compute(&our, &reference);
// Statistical deltas
println!("Mean delta: {:.4}", delta.mean_delta());
println!("Std delta: {:.4}", delta.std_delta());
println!("Percent: {:.2}%", delta.percent());
// Sign flip detection (common bug in normalization)
if delta.is_sign_flipped() {
println!("WARNING: Sign flip detected!");
}
// Vector similarity
if let Some(cos) = delta.cosine() {
println!("Cosine similarity: {:.4}", cos);
}
5. Distribution Comparison
// Cosine similarity for direction comparison
let cos = Delta::cosine_similarity(&vec_a, &vec_b);
// KL divergence for probability distributions
let kl = Delta::kl_divergence(&probs_a, &probs_b);
6. Automatic Diagnosis
When a stage fails, the system provides diagnostic hints:
if let Some(failure) = report.first_failure() {
println!("Failed stage: {}", failure.name());
for diagnosis in failure.diagnose() {
println!(" - {}", diagnosis);
}
}
Example output:
Diagnosis for 'mel_spectrogram' failure:
- Stage 'mel_spectrogram' failed with delta 89.1%
- Sign is FLIPPED (positive vs negative)
- Likely cause: Normalization formula error
- Check: Log base, subtraction order, sign convention
Real-World Use Case: Whisper Model Porting
let whisper = Pipeline::builder("whisper-tiny")
.stage("mel")
.ground_truth_stats(-0.215, 0.448)
.tolerance(Tolerance::percent(5.0))
.description("Log-mel spectrogram (80 mel bins)")
.build_stage()
.stage("encoder_out")
.ground_truth_stats(0.0, 0.8)
.tolerance(Tolerance::percent(10.0))
.description("Encoder final output")
.build_stage()
.stage("decoder_logits")
.ground_truth_stats(0.0, 15.0)
.tolerance(Tolerance::percent(15.0))
.description("Decoder output logits")
.build_stage()
.stage("probs")
.ground_truth_stats(0.0001, 0.01)
.tolerance(Tolerance::percent(20.0))
.description("Softmax probabilities")
.build_stage()
.build()?;
// Run verification against reference implementation
let report = whisper.verify(|stage| {
get_stage_output_from_our_implementation(stage)
});
if !report.all_passed() {
eprintln!("Verification failed!");
eprintln!("{}", report.summary());
if let Some(first_fail) = report.first_failure() {
eprintln!("\nFirst failure at: {}", first_fail.name());
for diag in first_fail.diagnose() {
eprintln!(" {}", diag);
}
}
}
Pipeline Verification in CI/CD
#[test]
fn test_model_regression() {
let pipeline = load_verification_pipeline();
let report = pipeline.verify(|stage| {
run_inference_stage(stage)
});
assert!(
report.all_passed(),
"Model regression detected: {}",
report.summary()
);
}
API Reference
Pipeline Builder
| Method | Description |
|---|---|
Pipeline::builder(name) | Create new pipeline |
.stage(name) | Add a stage |
.ground_truth_stats(mean, std) | Set expected statistics |
.ground_truth(gt) | Set full ground truth |
.tolerance(t) | Set tolerance threshold |
.description(desc) | Add human-readable description |
.build_stage() | Finish stage, return to pipeline |
.continue_on_failure() | Disable Jidoka |
.build() | Build the pipeline |
Tolerance Types
| Type | Use Case |
|---|---|
Tolerance::percent(n) | General purpose, % deviation |
Tolerance::stats(m, s) | Precision-critical stages |
Tolerance::cosine(min) | Embedding/vector comparisons |
Tolerance::kl_divergence(max) | Probability distributions |
Tolerance::custom() | Multi-criteria validation |
Report Methods
| Method | Returns |
|---|---|
report.all_passed() | bool |
report.first_failure() | Option<&StageResult> |
report.passed_count() | usize |
report.failed_count() | usize |
report.skipped_count() | usize |
report.summary() | String (colored) |
report.results() | &[StageResult] |
Toyota Way Principles Applied
- Jidoka (Built-in Quality): Stop-on-first-failure prevents cascading errors
- Genchi Genbutsu (Go and See): Stage-by-stage inspection reveals actual divergence points
- Kaizen (Continuous Improvement): CI/CD integration catches regressions early
- Visual Management: Colored output with pass/fail/skip icons