Model-Level Inference Tracing (Phase 13)
Model-level tracing provides deep visibility into transformer inference behavior, complementing brick-level profiling. While BrickProfiler tracks computational performance (timing, throughput), ModelTracer tracks semantic behavior—what the model is computing and why.
Overview
Five complementary tracing systems can be enabled independently:
| Trace Type | Purpose | Overhead | Output |
|---|---|---|---|
| LayerActivationTrace | Detect NaN/explosion/vanishing | ~2% | Statistics per layer |
| AttentionWeightTrace | Debug context/repetition | ~5% | Sparse attention matrix |
| LogitEvolutionTrace | Understand token selection | ~3% | Per-layer logit ranks |
| QuantizationErrorTrace | Measure quantization impact | ~10% | MSE vs FP32 reference |
| KvCacheStateTrace | Debug context window | ~1% | Cache utilization stats |
Quick Start
use trueno::brick::{ModelTracer, ModelTracerConfig, LayerActivationTrace, TensorStats};
// Create tracer with lightweight config (activations + KV cache)
let config = ModelTracerConfig::lightweight();
let mut tracer = ModelTracer::new(config);
// During inference forward pass
tracer.begin_forward(position);
// After each layer
let mut layer_trace = LayerActivationTrace::new(layer_idx);
layer_trace.input_stats = TensorStats::from_slice(&input_tensor);
layer_trace.output_stats = TensorStats::from_slice(&output_tensor);
tracer.record_layer_activation(layer_trace);
// End forward and check for anomalies
if let Some(anomaly) = tracer.end_forward() {
log::warn!("Anomaly detected: {}", anomaly);
}
// Get summary
println!("{}", tracer.summary());
TensorStats (MLT-01)
Computes tensor statistics in a single pass without storing the tensor:
use trueno::brick::TensorStats;
let data = vec![1.0, 2.0, 3.0, 4.0, 5.0];
let stats = TensorStats::from_slice(&data);
println!("count: {}", stats.count); // 5
println!("min: {}", stats.min); // 1.0
println!("max: {}", stats.max); // 5.0
println!("mean: {}", stats.mean); // 3.0
println!("std: {}", stats.std); // 1.58
println!("l2_norm: {}", stats.l2_norm); // 7.42
println!("nan_count: {}", stats.nan_count); // 0
println!("inf_count: {}", stats.inf_count); // 0
Anomaly Detection
// Detect NaN
let nan_data = vec![1.0, f32::NAN, 3.0];
let stats = TensorStats::from_slice(&nan_data);
assert!(stats.has_anomaly());
assert!(stats.anomaly_description().unwrap().contains("NaN"));
// Detect explosion (values > 1e6)
let explode_data = vec![1.0, 1e7, 3.0];
let stats = TensorStats::from_slice(&explode_data);
assert!(stats.has_anomaly());
// Detect vanishing (std < 1e-6)
let vanish_data = vec![1.0, 1.0, 1.0];
let stats = TensorStats::from_slice(&vanish_data);
assert!(stats.is_vanishing());
LayerActivationTrace
Track activations through transformer layers:
use trueno::brick::{LayerActivationTrace, ModelActivationTrace, TensorStats};
let mut model_trace = ModelActivationTrace::with_capacity(32);
for layer_idx in 0..32 {
let mut layer = LayerActivationTrace::new(layer_idx);
// Record stats at each stage
layer.input_stats = TensorStats::from_slice(&input);
layer.post_norm_stats = TensorStats::from_slice(&after_norm);
layer.post_attn_stats = TensorStats::from_slice(&after_attn);
layer.post_ffn_stats = TensorStats::from_slice(&after_ffn);
layer.output_stats = TensorStats::from_slice(&output);
// Track residual connection health
layer.residual_ratio = compute_residual_ratio(&output, &after_attn);
model_trace.add_layer(layer);
}
model_trace.finalize();
if model_trace.has_anomaly {
println!("Anomaly: {}", model_trace.anomaly_desc.unwrap());
}
Anomaly Rules
- NaN detected:
nan_count > 0 - Explosion:
max.abs() > 1e6orstd > 1e4 - Vanishing:
std < 1e-6(after warmup layers) - Residual dominance:
residual_ratio > 0.99(skip connection bypass)
AttentionWeightTrace (MLT-02)
Debug attention patterns without storing full matrices:
use trueno::brick::{AttentionWeightTrace, AttentionTraceConfig};
// Create trace from attention weights
let weights = vec![0.4, 0.1, 0.05, 0.05, 0.2, 0.1, 0.1];
let trace = AttentionWeightTrace::from_weights(
0, // layer_idx
0, // head_idx
6, // query_pos (current token)
&weights,
5, // top_k
);
println!("Top-5 positions: {:?}", trace.top_k_positions);
println!("Top-5 weights: {:?}", trace.top_k_weights);
println!("Tail mass: {}", trace.tail_mass);
println!("Entropy: {}", trace.entropy);
// Diagnostic patterns
if trace.is_attention_sink(0.3) {
println!("Warning: Attention sink on BOS token");
}
if trace.has_recency_bias(5, 0.7) {
println!("Warning: Strong recency bias (repetition risk)");
}
Configure Selective Tracing
let config = AttentionTraceConfig {
top_k: 10,
layers: Some(vec![0, 15, 31]), // Only trace specific layers
heads: Some(vec![0, 1]), // Only trace specific heads
weight_threshold: 0.01,
};
if config.should_trace_layer(15) && config.should_trace_head(0) {
// Record trace
}
LogitEvolutionTrace (MLT-03)
Track how token probabilities evolve through layers:
use trueno::brick::{LogitEvolutionTrace, TokenLogitEvolution};
let mut trace = LogitEvolutionTrace::new(100, 0.7, 0.9);
// Track specific tokens
let token = trace.track_token(42, "hello".to_string());
token.record_layer(0.5, 500); // logit, rank at layer 0
token.record_layer(1.0, 200); // layer 1
token.record_layer(3.0, 10); // layer 2
token.record_layer(5.0, 1); // layer 3
// Find where the token's fate was decided
if let Some(layer) = token.decisive_layer() {
println!("Token 'hello' rank changed most at layer {}", layer);
}
// Compute rank directly
let logits = vec![1.0, 5.0, 3.0, 2.0, 4.0];
let rank = LogitEvolutionTrace::compute_rank(&logits, 1); // token 1 has highest logit
assert_eq!(rank, 0);
QuantizationErrorTrace (MLT-04)
Measure quantization impact:
use trueno::brick::{QuantizationErrorTrace, QuantType, BrickId};
let reference = vec![1.0, 2.0, 3.0, 4.0];
let quantized = vec![1.02, 1.98, 3.05, 3.95];
let trace = QuantizationErrorTrace::compute(
BrickId::QkvProjection,
5,
&quantized,
&reference,
QuantType::Q4_K,
);
println!("MSE: {:.6}", trace.mse);
println!("Cosine similarity: {:.6}", trace.cosine_similarity);
println!("SNR: {:.1} dB", trace.snr_db);
// Thresholds (from llama.cpp Q4_K validation)
if trace.is_acceptable() {
println!("Quantization acceptable (cosine > 0.995)");
} else if trace.is_warning() {
println!("Quantization warning (0.99 < cosine < 0.995)");
} else {
println!("Quantization CRITICAL (cosine < 0.99)");
}
Quantization Types
use trueno::brick::QuantType;
// Get bits per element
assert_eq!(QuantType::F32.bits_per_element(), 32.0);
assert_eq!(QuantType::Q4_K.bits_per_element(), 4.5);
// Compression ratios
println!("Q4_K compression: {:.1}x", QuantType::Q4_K.compression_ratio());
// Output: Q4_K compression: 7.1x
KvCacheStateTrace (MLT-05)
Monitor KV cache behavior:
use trueno::brick::{KvCacheStateTrace, KvCacheSessionTrace};
let mut session = KvCacheSessionTrace::default();
for step in 0..100 {
let mut trace = KvCacheStateTrace::new(step, 2048);
trace.valid_positions = step + 1;
trace.cache_size_bytes = (step + 1) * 4096;
trace.cache_hit_rate = 0.95;
trace.evictions_this_step = if step > 90 { 1 } else { 0 };
session.add_step(trace);
}
println!("Total evictions: {}", session.total_evictions);
println!("Peak memory: {} KB", session.peak_memory_bytes / 1024);
println!("Avg hit rate: {:.1}%", session.avg_hit_rate * 100.0);
// Detect thrashing
if session.has_thrashing(50, 0.5) {
println!("WARNING: Context thrashing detected");
}
Unified ModelTracer
Combine all trace types:
use trueno::brick::{ModelTracer, ModelTracerConfig};
// Full tracing (debugging)
let config = ModelTracerConfig::full();
// Lightweight (production)
let config = ModelTracerConfig::lightweight();
// Disabled (zero overhead)
let config = ModelTracerConfig::default();
assert!(!config.is_enabled());
// Custom
let config = ModelTracerConfig {
trace_activations: true,
trace_attention: false, // Skip attention tracing
trace_logits: true,
trace_quant_error: false, // Too expensive for production
trace_kv_cache: true,
..Default::default()
};
let mut tracer = ModelTracer::new(config);
// During inference
tracer.begin_forward(position);
// ... record traces ...
if let Some(anomaly) = tracer.end_forward() {
log::warn!("Anomaly: {}", anomaly);
}
// Summary
println!("{}", tracer.summary());
Running the Example
cargo run --example model_tracing
Performance Considerations
- Zero-cost when disabled:
ModelTracerConfig::default()produces no overhead - Lightweight for production: Use
ModelTracerConfig::lightweight()(~3% overhead) - TensorStats is single-pass: Uses Welford's algorithm, no extra allocations
- AttentionWeightTrace is sparse: Only stores top-k, not full attention matrix
- Enable selectively: Use
AttentionTraceConfigto trace only specific layers/heads
Falsification Tests
The implementation includes comprehensive tests (F250-F275):
- F250: TensorStats correctness
- F251: NaN detection (100% recall)
- F252: Explosion detection
- F253: Attention top-k sorting
- F258: Cosine similarity range
- F263: Overhead bounds
- F272: Bit-exactness
Run with:
cargo test test_f250 --lib
cargo test test_f251 --lib
# ... etc
See Also
- BrickProfiler - Computational performance profiling
- ML Tuner - Learned kernel selection
- Spec E.11 - Full specification