Model-Level Inference Tracing (Phase 13)

Model-level tracing provides deep visibility into transformer inference behavior, complementing brick-level profiling. While BrickProfiler tracks computational performance (timing, throughput), ModelTracer tracks semantic behavior—what the model is computing and why.

Overview

Five complementary tracing systems can be enabled independently:

Trace TypePurposeOverheadOutput
LayerActivationTraceDetect NaN/explosion/vanishing~2%Statistics per layer
AttentionWeightTraceDebug context/repetition~5%Sparse attention matrix
LogitEvolutionTraceUnderstand token selection~3%Per-layer logit ranks
QuantizationErrorTraceMeasure quantization impact~10%MSE vs FP32 reference
KvCacheStateTraceDebug context window~1%Cache utilization stats

Quick Start

use trueno::brick::{ModelTracer, ModelTracerConfig, LayerActivationTrace, TensorStats};

// Create tracer with lightweight config (activations + KV cache)
let config = ModelTracerConfig::lightweight();
let mut tracer = ModelTracer::new(config);

// During inference forward pass
tracer.begin_forward(position);

// After each layer
let mut layer_trace = LayerActivationTrace::new(layer_idx);
layer_trace.input_stats = TensorStats::from_slice(&input_tensor);
layer_trace.output_stats = TensorStats::from_slice(&output_tensor);
tracer.record_layer_activation(layer_trace);

// End forward and check for anomalies
if let Some(anomaly) = tracer.end_forward() {
    log::warn!("Anomaly detected: {}", anomaly);
}

// Get summary
println!("{}", tracer.summary());

TensorStats (MLT-01)

Computes tensor statistics in a single pass without storing the tensor:

use trueno::brick::TensorStats;

let data = vec![1.0, 2.0, 3.0, 4.0, 5.0];
let stats = TensorStats::from_slice(&data);

println!("count: {}", stats.count);       // 5
println!("min: {}", stats.min);           // 1.0
println!("max: {}", stats.max);           // 5.0
println!("mean: {}", stats.mean);         // 3.0
println!("std: {}", stats.std);           // 1.58
println!("l2_norm: {}", stats.l2_norm);   // 7.42
println!("nan_count: {}", stats.nan_count); // 0
println!("inf_count: {}", stats.inf_count); // 0

Anomaly Detection

// Detect NaN
let nan_data = vec![1.0, f32::NAN, 3.0];
let stats = TensorStats::from_slice(&nan_data);
assert!(stats.has_anomaly());
assert!(stats.anomaly_description().unwrap().contains("NaN"));

// Detect explosion (values > 1e6)
let explode_data = vec![1.0, 1e7, 3.0];
let stats = TensorStats::from_slice(&explode_data);
assert!(stats.has_anomaly());

// Detect vanishing (std < 1e-6)
let vanish_data = vec![1.0, 1.0, 1.0];
let stats = TensorStats::from_slice(&vanish_data);
assert!(stats.is_vanishing());

LayerActivationTrace

Track activations through transformer layers:

use trueno::brick::{LayerActivationTrace, ModelActivationTrace, TensorStats};

let mut model_trace = ModelActivationTrace::with_capacity(32);

for layer_idx in 0..32 {
    let mut layer = LayerActivationTrace::new(layer_idx);

    // Record stats at each stage
    layer.input_stats = TensorStats::from_slice(&input);
    layer.post_norm_stats = TensorStats::from_slice(&after_norm);
    layer.post_attn_stats = TensorStats::from_slice(&after_attn);
    layer.post_ffn_stats = TensorStats::from_slice(&after_ffn);
    layer.output_stats = TensorStats::from_slice(&output);

    // Track residual connection health
    layer.residual_ratio = compute_residual_ratio(&output, &after_attn);

    model_trace.add_layer(layer);
}

model_trace.finalize();
if model_trace.has_anomaly {
    println!("Anomaly: {}", model_trace.anomaly_desc.unwrap());
}

Anomaly Rules

  • NaN detected: nan_count > 0
  • Explosion: max.abs() > 1e6 or std > 1e4
  • Vanishing: std < 1e-6 (after warmup layers)
  • Residual dominance: residual_ratio > 0.99 (skip connection bypass)

AttentionWeightTrace (MLT-02)

Debug attention patterns without storing full matrices:

use trueno::brick::{AttentionWeightTrace, AttentionTraceConfig};

// Create trace from attention weights
let weights = vec![0.4, 0.1, 0.05, 0.05, 0.2, 0.1, 0.1];
let trace = AttentionWeightTrace::from_weights(
    0,       // layer_idx
    0,       // head_idx
    6,       // query_pos (current token)
    &weights,
    5,       // top_k
);

println!("Top-5 positions: {:?}", trace.top_k_positions);
println!("Top-5 weights: {:?}", trace.top_k_weights);
println!("Tail mass: {}", trace.tail_mass);
println!("Entropy: {}", trace.entropy);

// Diagnostic patterns
if trace.is_attention_sink(0.3) {
    println!("Warning: Attention sink on BOS token");
}
if trace.has_recency_bias(5, 0.7) {
    println!("Warning: Strong recency bias (repetition risk)");
}

Configure Selective Tracing

let config = AttentionTraceConfig {
    top_k: 10,
    layers: Some(vec![0, 15, 31]),  // Only trace specific layers
    heads: Some(vec![0, 1]),         // Only trace specific heads
    weight_threshold: 0.01,
};

if config.should_trace_layer(15) && config.should_trace_head(0) {
    // Record trace
}

LogitEvolutionTrace (MLT-03)

Track how token probabilities evolve through layers:

use trueno::brick::{LogitEvolutionTrace, TokenLogitEvolution};

let mut trace = LogitEvolutionTrace::new(100, 0.7, 0.9);

// Track specific tokens
let token = trace.track_token(42, "hello".to_string());
token.record_layer(0.5, 500);  // logit, rank at layer 0
token.record_layer(1.0, 200);  // layer 1
token.record_layer(3.0, 10);   // layer 2
token.record_layer(5.0, 1);    // layer 3

// Find where the token's fate was decided
if let Some(layer) = token.decisive_layer() {
    println!("Token 'hello' rank changed most at layer {}", layer);
}

// Compute rank directly
let logits = vec![1.0, 5.0, 3.0, 2.0, 4.0];
let rank = LogitEvolutionTrace::compute_rank(&logits, 1); // token 1 has highest logit
assert_eq!(rank, 0);

QuantizationErrorTrace (MLT-04)

Measure quantization impact:

use trueno::brick::{QuantizationErrorTrace, QuantType, BrickId};

let reference = vec![1.0, 2.0, 3.0, 4.0];
let quantized = vec![1.02, 1.98, 3.05, 3.95];

let trace = QuantizationErrorTrace::compute(
    BrickId::QkvProjection,
    5,
    &quantized,
    &reference,
    QuantType::Q4_K,
);

println!("MSE: {:.6}", trace.mse);
println!("Cosine similarity: {:.6}", trace.cosine_similarity);
println!("SNR: {:.1} dB", trace.snr_db);

// Thresholds (from llama.cpp Q4_K validation)
if trace.is_acceptable() {
    println!("Quantization acceptable (cosine > 0.995)");
} else if trace.is_warning() {
    println!("Quantization warning (0.99 < cosine < 0.995)");
} else {
    println!("Quantization CRITICAL (cosine < 0.99)");
}

Quantization Types

use trueno::brick::QuantType;

// Get bits per element
assert_eq!(QuantType::F32.bits_per_element(), 32.0);
assert_eq!(QuantType::Q4_K.bits_per_element(), 4.5);

// Compression ratios
println!("Q4_K compression: {:.1}x", QuantType::Q4_K.compression_ratio());
// Output: Q4_K compression: 7.1x

KvCacheStateTrace (MLT-05)

Monitor KV cache behavior:

use trueno::brick::{KvCacheStateTrace, KvCacheSessionTrace};

let mut session = KvCacheSessionTrace::default();

for step in 0..100 {
    let mut trace = KvCacheStateTrace::new(step, 2048);
    trace.valid_positions = step + 1;
    trace.cache_size_bytes = (step + 1) * 4096;
    trace.cache_hit_rate = 0.95;
    trace.evictions_this_step = if step > 90 { 1 } else { 0 };

    session.add_step(trace);
}

println!("Total evictions: {}", session.total_evictions);
println!("Peak memory: {} KB", session.peak_memory_bytes / 1024);
println!("Avg hit rate: {:.1}%", session.avg_hit_rate * 100.0);

// Detect thrashing
if session.has_thrashing(50, 0.5) {
    println!("WARNING: Context thrashing detected");
}

Unified ModelTracer

Combine all trace types:

use trueno::brick::{ModelTracer, ModelTracerConfig};

// Full tracing (debugging)
let config = ModelTracerConfig::full();

// Lightweight (production)
let config = ModelTracerConfig::lightweight();

// Disabled (zero overhead)
let config = ModelTracerConfig::default();
assert!(!config.is_enabled());

// Custom
let config = ModelTracerConfig {
    trace_activations: true,
    trace_attention: false,  // Skip attention tracing
    trace_logits: true,
    trace_quant_error: false,  // Too expensive for production
    trace_kv_cache: true,
    ..Default::default()
};

let mut tracer = ModelTracer::new(config);

// During inference
tracer.begin_forward(position);
// ... record traces ...
if let Some(anomaly) = tracer.end_forward() {
    log::warn!("Anomaly: {}", anomaly);
}

// Summary
println!("{}", tracer.summary());

Running the Example

cargo run --example model_tracing

Performance Considerations

  1. Zero-cost when disabled: ModelTracerConfig::default() produces no overhead
  2. Lightweight for production: Use ModelTracerConfig::lightweight() (~3% overhead)
  3. TensorStats is single-pass: Uses Welford's algorithm, no extra allocations
  4. AttentionWeightTrace is sparse: Only stores top-k, not full attention matrix
  5. Enable selectively: Use AttentionTraceConfig to trace only specific layers/heads

Falsification Tests

The implementation includes comprehensive tests (F250-F275):

  • F250: TensorStats correctness
  • F251: NaN detection (100% recall)
  • F252: Explosion detection
  • F253: Attention top-k sorting
  • F258: Cosine similarity range
  • F263: Overhead bounds
  • F272: Bit-exactness

Run with:

cargo test test_f250 --lib
cargo test test_f251 --lib
# ... etc

See Also