Profiling

The Real Profiling Mandate

Trueno enforces a strict "Real Profiling" mandate. All performance metrics reported by the ecosystem MUST be measured, not derived.

Forbidden: Calculating per-brick time by taking total throughput and multiplying by a budget fraction. Required: Measuring start/end times for every operation, with full synchronization.

Why?

Simulated or derived metrics mask bottlenecks. If you assume an operation takes 10% of the time, you will never discover when it actually takes 50% due to a regression.

BrickProfiler v2 (PAR-200)

The BrickProfiler is the core profiling tool built into trueno. Version 2 (PAR-200) introduces O(1) hot-path profiling with deferred sync support.

Key Features

Feature	v1	v2 (PAR-200)
Brick lookup	HashMap O(n)	BrickId enum O(1)
GPU sync	Immediate (~200% overhead)	Deferred (~5% overhead)
Category aggregation	Manual	Automatic (Norm, Attention, FFN)

BrickId Enum

use trueno::{BrickProfiler, BrickId, BrickCategory, SyncMode};

let mut profiler = BrickProfiler::new();
profiler.enable();

// O(1) brick timing with enum-based lookup
let timer = profiler.start_brick(BrickId::QkvProjection);
// ... perform QKV projection ...
profiler.stop_brick(timer, num_elements);

// Category breakdown
let cats = profiler.category_stats();
println!("Attention: {:.1}%", cats[BrickCategory::Attention as usize].percentage(profiler.total_ns()));

Deferred Sync Mode

For GPU workloads, immediate synchronization after every operation adds ~200% overhead. Deferred sync batches measurements:

profiler.set_sync_mode(SyncMode::Deferred);
profiler.reset_epoch();

// Record without sync (timestamps only)
let start = profiler.elapsed_ns();
// ... GPU kernel launch ...
profiler.record_deferred(BrickId::AttentionScore, start, elements);

// Single sync point at end of layer/batch
let end = profiler.elapsed_ns();
profiler.finalize(end);  // Apply all pending measurements

Sync Modes:

Immediate: Sync after every brick (~200% overhead, accurate per-brick)
PerLayer: Sync once per transformer layer (~20% overhead)
Deferred: Single sync at batch end (~5% overhead)
None: No timing (0% overhead, for production)

Running the Example

cargo run --example brick_profiler_v2

Output:

=== PAR-200: BrickProfiler v2 Demo ===

Per-Brick Timing:
Brick                  Avg (µs) Total (µs)    Count
----------------------------------------------------
RmsNorm                   104.6      313.9        3
QkvProjection             253.8      761.3        3
...

Category Breakdown:
Category       Avg (µs)      Pct    Samples
--------------------------------------------
Norm              104.6     8.3%          3
Attention         228.7    36.4%          6
FFN               348.0    55.3%          6

Integration with Realizar

The realizar inference engine integrates BrickProfiler v2:

// In CudaExecutor
executor.set_profiler_sync_mode(SyncMode::Deferred);

// During forward pass
let timer = executor.start_brick_id(BrickId::QkvProjection);
// ... kernel launch ...
executor.stop_brick_id(timer, hidden_dim as u64);

Falsification Protocols (F101-F110)

To prove profiling is real, we apply Popperian Falsification:

F101: BrickId::COUNT == 15 (all brick types defined)
F102: Category mapping correct for all BrickIds
F103: Deferred mode accumulates pending measurements
F104: finalize() clears pending queue
F105: Zero-overhead when disabled
F106: Array indexing is O(1)
F107: Thread-safe (Send + Sync)
F108: BrickIdTimer fits in 32 bytes
F109: elapsed_ns() monotonic
F110: Category stats sum correctly

Execution Path Graph (PAR-201)

BrickProfiler v2 also supports execution path graphs for tracking the full hierarchy:

Layer(0)
  └─► Brick(QkvProjection) ─────► Kernel(batched_q4k_gemv, ptx_hash=0x7a3b...)
  │                                   └─► PTX source lookup
  └─► Brick(AttentionScore) ────► Kernel(incremental_attention, ptx_hash=0x9f1c...)

Enabling Graph Recording

use trueno::{BrickProfiler, BrickId, ExecutionNode, PtxRegistry};

let mut profiler = BrickProfiler::new();
profiler.enable();
profiler.enable_graph();  // Enable execution graph tracking

// Record layer scope
profiler.graph_push_scope(ExecutionNode::Layer { index: 0 });

  // Record brick
  let timer = profiler.start_brick(BrickId::QkvProjection);
  // ... work ...
  profiler.stop_brick(timer, elements);
  profiler.graph_record_brick(BrickId::QkvProjection, timing_ns, elements);

  // Record kernel launch
  profiler.graph_record_kernel(
      "batched_q4k_gemv",
      ptx_hash,
      (32, 1, 1),   // grid
      (256, 1, 1),  // block
      4096,         // shared_mem
  );

profiler.graph_pop_scope();

PTX Registry

Track PTX source code for debugging:

let mut registry = PtxRegistry::new();
registry.register("kernel_name", ptx_source, Some(Path::new("src/kernel.ptx")));

// Lookup by hash
let hash = PtxRegistry::hash_ptx(ptx_source);
let source = registry.lookup(hash);

Visualization

Option 1: Headless ASCII Tree (CI/CD, Automation)

Zero-dependency tree visualization for testing and automation:

let graph = profiler.execution_graph();
let tree = graph.to_ascii_tree();
println!("{}", tree);

// Output can be used for:
// - Snapshot tests (deterministic output)
// - CI/CD logs
// - File export
std::fs::write("execution_tree.txt", &tree)?;

Output:

Layer 0
├── RmsNorm  50.0µs (4096 elem)
│   └── rmsnorm_kernel  <<<16,256,1>>> smem=1024B
├── QkvProjection  200.0µs (4096 elem)
│   └── batched_q4k_gemv  <<<32,256,1>>> smem=4096B

Option 2: Interactive TUI (presentar-terminal)

Native TUI widget for interactive exploration (requires presentar-tui feature):

use trueno::ExecutionGraph;
use presentar_terminal::{Tree, TuiApp};

// Convert execution graph to tree widget
let tree_node = profiler.execution_graph().to_tree_node();
let tree = Tree::new().with_root(tree_node).expand_all();

// Use in TUI app or render headless via HeadlessCanvas

Option 3: Graphviz DOT Export

Export to Graphviz DOT format for SVG rendering:

# In code
let dot = profiler.graph_to_dot();
std::fs::write("graph.dot", dot)?;

# Visualize
dot -Tsvg graph.dot -o graph.svg

Running the Example

# Headless ASCII tree (default, no dependencies)
cargo run --example execution_graph

# With presentar-terminal TreeNode
cargo run --example execution_graph --features presentar-tui

Backend-Specific Profiling (CPU/SIMD/GPU)

Different compute backends require different profiling approaches. See the full specification in docs/specifications/ml-tuner-bricks.md (Appendix E.8).

Instrumentation Status

Backend	Path	BrickProfiler	Overhead
CUDA	`CudaExecutor::forward()`	Full	~5% (deferred)
CPU	`forward()`	None	N/A
CPU	`forward_profiled()`	Full	~10%
SIMD	trueno ops	Per-op	~2%

Key Insight: The legacy CPU forward() function lacks BrickProfiler instrumentation. For CPU profiling, use forward_profiled() or add instrumentation manually.

SIMD Backend Profiling

Profile SIMD operations at the brick level:

use trueno::{BrickProfiler, BrickId};

let mut profiler = BrickProfiler::new();
profiler.enable();

// Profile SIMD operation
let timer = profiler.start_brick(BrickId::RmsNorm);
trueno::simd::rms_norm_avx2(&input, &mut output);  // AVX2 backend
profiler.stop_brick(timer, input.len() as u64);

// Get throughput
let stats = profiler.stats_for(BrickId::RmsNorm);
let throughput = stats.total_elements as f64 / stats.total_ns as f64 * 1000.0;
println!("RmsNorm: {:.2} Melem/s", throughput);

Backend Comparison

Compare performance across backends:

use trueno::{BrickProfiler, BrickId, detect_backend, Backend};

let backend = detect_backend();
let mut profiler = BrickProfiler::new();
profiler.enable();

// Same brick, different backends
match backend {
    Backend::Avx512 => { /* AVX-512 path */ }
    Backend::Avx2 => { /* AVX2 path */ }
    Backend::Neon => { /* ARM NEON path */ }
    _ => { /* Scalar fallback */ }
}

// Report includes backend name
println!("Backend: {:?}", backend);
println!("{}", profiler.report());

Backend-Specific Roofline

Different backends have different theoretical peaks:

Backend	Peak TFLOPS (FP32)	Memory BW (GB/s)
RTX 4090	83.0	1008
AVX-512	~2.0	~100
AVX2	~0.5	~50
ARM NEON	~0.2	~40
Scalar	~0.1	~25

// Backend-aware roofline distance
let distance = match backend {
    Backend::Cuda => graph.roofline_distance(83.0, 1008.0),
    Backend::Avx512 => graph.roofline_distance(2.0, 100.0),
    Backend::Avx2 => graph.roofline_distance(0.5, 50.0),
    _ => graph.roofline_distance(0.1, 25.0),
};

Critical Path Analysis (Phase 9)

Identify true bottlenecks vs parallelizable work:

let graph = profiler.execution_graph();

// Get critical path
let (critical_path, total_ns) = graph.critical_path();
println!("Critical path: {} nodes, {:.2}ms", critical_path.len(), total_ns as f64 / 1_000_000.0);

// Find parallelization opportunities
let slack = graph.compute_slack();
for (node_id, slack_ns) in &slack {
    if *slack_ns > 0 {
        println!("Node {} can be parallelized (slack: {}µs)", node_id.0, slack_ns / 1000);
    }
}

// Formatted summary
println!("{}", graph.critical_path_summary());

Tools

presentar-terminal Tree: Native TUI tree widget for hierarchical execution graphs.
cbtop: The primary visualization tool for ComputeBrick pipelines. Supports backend-specific profiling display.
perf / flamegraph: For CPU-side overhead analysis.
nsight: For deep GPU kernel inspection (external to the pure Rust stack).

Trueno - High-Performance SIMD/GPU Compute Library