Profiling
The Real Profiling Mandate
Trueno enforces a strict "Real Profiling" mandate. All performance metrics reported by the ecosystem MUST be measured, not derived.
Forbidden: Calculating per-brick time by taking total throughput and multiplying by a budget fraction. Required: Measuring start/end times for every operation, with full synchronization.
Why?
Simulated or derived metrics mask bottlenecks. If you assume an operation takes 10% of the time, you will never discover when it actually takes 50% due to a regression.
BrickProfiler v2 (PAR-200)
The BrickProfiler is the core profiling tool built into trueno. Version 2 (PAR-200) introduces O(1) hot-path profiling with deferred sync support.
Key Features
| Feature | v1 | v2 (PAR-200) |
|---|---|---|
| Brick lookup | HashMap | BrickId enum O(1) |
| GPU sync | Immediate (~200% overhead) | Deferred (~5% overhead) |
| Category aggregation | Manual | Automatic (Norm, Attention, FFN) |
BrickId Enum
use trueno::{BrickProfiler, BrickId, BrickCategory, SyncMode};
let mut profiler = BrickProfiler::new();
profiler.enable();
// O(1) brick timing with enum-based lookup
let timer = profiler.start_brick(BrickId::QkvProjection);
// ... perform QKV projection ...
profiler.stop_brick(timer, num_elements);
// Category breakdown
let cats = profiler.category_stats();
println!("Attention: {:.1}%", cats[BrickCategory::Attention as usize].percentage(profiler.total_ns()));
Deferred Sync Mode
For GPU workloads, immediate synchronization after every operation adds ~200% overhead. Deferred sync batches measurements:
profiler.set_sync_mode(SyncMode::Deferred);
profiler.reset_epoch();
// Record without sync (timestamps only)
let start = profiler.elapsed_ns();
// ... GPU kernel launch ...
profiler.record_deferred(BrickId::AttentionScore, start, elements);
// Single sync point at end of layer/batch
let end = profiler.elapsed_ns();
profiler.finalize(end); // Apply all pending measurements
Sync Modes:
Immediate: Sync after every brick (~200% overhead, accurate per-brick)PerLayer: Sync once per transformer layer (~20% overhead)Deferred: Single sync at batch end (~5% overhead)None: No timing (0% overhead, for production)
Running the Example
cargo run --example brick_profiler_v2
Output:
=== PAR-200: BrickProfiler v2 Demo ===
Per-Brick Timing:
Brick Avg (µs) Total (µs) Count
----------------------------------------------------
RmsNorm 104.6 313.9 3
QkvProjection 253.8 761.3 3
...
Category Breakdown:
Category Avg (µs) Pct Samples
--------------------------------------------
Norm 104.6 8.3% 3
Attention 228.7 36.4% 6
FFN 348.0 55.3% 6
Integration with Realizar
The realizar inference engine integrates BrickProfiler v2:
// In CudaExecutor
executor.set_profiler_sync_mode(SyncMode::Deferred);
// During forward pass
let timer = executor.start_brick_id(BrickId::QkvProjection);
// ... kernel launch ...
executor.stop_brick_id(timer, hidden_dim as u64);
Falsification Protocols (F101-F110)
To prove profiling is real, we apply Popperian Falsification:
- F101:
BrickId::COUNT == 15(all brick types defined) - F102: Category mapping correct for all BrickIds
- F103: Deferred mode accumulates pending measurements
- F104:
finalize()clears pending queue - F105: Zero-overhead when disabled
- F106: Array indexing is O(1)
- F107: Thread-safe (Send + Sync)
- F108: BrickIdTimer fits in 32 bytes
- F109:
elapsed_ns()monotonic - F110: Category stats sum correctly
Execution Path Graph (PAR-201)
BrickProfiler v2 also supports execution path graphs for tracking the full hierarchy:
Layer(0)
└─► Brick(QkvProjection) ─────► Kernel(batched_q4k_gemv, ptx_hash=0x7a3b...)
│ └─► PTX source lookup
└─► Brick(AttentionScore) ────► Kernel(incremental_attention, ptx_hash=0x9f1c...)
Enabling Graph Recording
use trueno::{BrickProfiler, BrickId, ExecutionNode, PtxRegistry};
let mut profiler = BrickProfiler::new();
profiler.enable();
profiler.enable_graph(); // Enable execution graph tracking
// Record layer scope
profiler.graph_push_scope(ExecutionNode::Layer { index: 0 });
// Record brick
let timer = profiler.start_brick(BrickId::QkvProjection);
// ... work ...
profiler.stop_brick(timer, elements);
profiler.graph_record_brick(BrickId::QkvProjection, timing_ns, elements);
// Record kernel launch
profiler.graph_record_kernel(
"batched_q4k_gemv",
ptx_hash,
(32, 1, 1), // grid
(256, 1, 1), // block
4096, // shared_mem
);
profiler.graph_pop_scope();
PTX Registry
Track PTX source code for debugging:
let mut registry = PtxRegistry::new();
registry.register("kernel_name", ptx_source, Some(Path::new("src/kernel.ptx")));
// Lookup by hash
let hash = PtxRegistry::hash_ptx(ptx_source);
let source = registry.lookup(hash);
Visualization
Option 1: Headless ASCII Tree (CI/CD, Automation)
Zero-dependency tree visualization for testing and automation:
let graph = profiler.execution_graph();
let tree = graph.to_ascii_tree();
println!("{}", tree);
// Output can be used for:
// - Snapshot tests (deterministic output)
// - CI/CD logs
// - File export
std::fs::write("execution_tree.txt", &tree)?;
Output:
Layer 0
├── RmsNorm 50.0µs (4096 elem)
│ └── rmsnorm_kernel <<<16,256,1>>> smem=1024B
├── QkvProjection 200.0µs (4096 elem)
│ └── batched_q4k_gemv <<<32,256,1>>> smem=4096B
Option 2: Interactive TUI (presentar-terminal)
Native TUI widget for interactive exploration (requires presentar-tui feature):
use trueno::ExecutionGraph;
use presentar_terminal::{Tree, TuiApp};
// Convert execution graph to tree widget
let tree_node = profiler.execution_graph().to_tree_node();
let tree = Tree::new().with_root(tree_node).expand_all();
// Use in TUI app or render headless via HeadlessCanvas
Option 3: Graphviz DOT Export
Export to Graphviz DOT format for SVG rendering:
# In code
let dot = profiler.graph_to_dot();
std::fs::write("graph.dot", dot)?;
# Visualize
dot -Tsvg graph.dot -o graph.svg
Running the Example
# Headless ASCII tree (default, no dependencies)
cargo run --example execution_graph
# With presentar-terminal TreeNode
cargo run --example execution_graph --features presentar-tui
Backend-Specific Profiling (CPU/SIMD/GPU)
Different compute backends require different profiling approaches. See the full specification in docs/specifications/ml-tuner-bricks.md (Appendix E.8).
Instrumentation Status
| Backend | Path | BrickProfiler | Overhead |
|---|---|---|---|
| CUDA | CudaExecutor::forward() | Full | ~5% (deferred) |
| CPU | forward() | None | N/A |
| CPU | forward_profiled() | Full | ~10% |
| SIMD | trueno ops | Per-op | ~2% |
Key Insight: The legacy CPU forward() function lacks BrickProfiler instrumentation. For CPU profiling, use forward_profiled() or add instrumentation manually.
SIMD Backend Profiling
Profile SIMD operations at the brick level:
use trueno::{BrickProfiler, BrickId};
let mut profiler = BrickProfiler::new();
profiler.enable();
// Profile SIMD operation
let timer = profiler.start_brick(BrickId::RmsNorm);
trueno::simd::rms_norm_avx2(&input, &mut output); // AVX2 backend
profiler.stop_brick(timer, input.len() as u64);
// Get throughput
let stats = profiler.stats_for(BrickId::RmsNorm);
let throughput = stats.total_elements as f64 / stats.total_ns as f64 * 1000.0;
println!("RmsNorm: {:.2} Melem/s", throughput);
Backend Comparison
Compare performance across backends:
use trueno::{BrickProfiler, BrickId, detect_backend, Backend};
let backend = detect_backend();
let mut profiler = BrickProfiler::new();
profiler.enable();
// Same brick, different backends
match backend {
Backend::Avx512 => { /* AVX-512 path */ }
Backend::Avx2 => { /* AVX2 path */ }
Backend::Neon => { /* ARM NEON path */ }
_ => { /* Scalar fallback */ }
}
// Report includes backend name
println!("Backend: {:?}", backend);
println!("{}", profiler.report());
Backend-Specific Roofline
Different backends have different theoretical peaks:
| Backend | Peak TFLOPS (FP32) | Memory BW (GB/s) |
|---|---|---|
| RTX 4090 | 83.0 | 1008 |
| AVX-512 | ~2.0 | ~100 |
| AVX2 | ~0.5 | ~50 |
| ARM NEON | ~0.2 | ~40 |
| Scalar | ~0.1 | ~25 |
// Backend-aware roofline distance
let distance = match backend {
Backend::Cuda => graph.roofline_distance(83.0, 1008.0),
Backend::Avx512 => graph.roofline_distance(2.0, 100.0),
Backend::Avx2 => graph.roofline_distance(0.5, 50.0),
_ => graph.roofline_distance(0.1, 25.0),
};
Critical Path Analysis (Phase 9)
Identify true bottlenecks vs parallelizable work:
let graph = profiler.execution_graph();
// Get critical path
let (critical_path, total_ns) = graph.critical_path();
println!("Critical path: {} nodes, {:.2}ms", critical_path.len(), total_ns as f64 / 1_000_000.0);
// Find parallelization opportunities
let slack = graph.compute_slack();
for (node_id, slack_ns) in &slack {
if *slack_ns > 0 {
println!("Node {} can be parallelized (slack: {}µs)", node_id.0, slack_ns / 1000);
}
}
// Formatted summary
println!("{}", graph.critical_path_summary());
Tools
- presentar-terminal Tree: Native TUI tree widget for hierarchical execution graphs.
- cbtop: The primary visualization tool for ComputeBrick pipelines. Supports backend-specific profiling display.
- perf / flamegraph: For CPU-side overhead analysis.
- nsight: For deep GPU kernel inspection (external to the pure Rust stack).