Execution Path Graph

The Execution Path Graph (PAR-201) tracks the full hierarchy of operations during inference: Layer → Brick → Kernel → PTX. This enables precise profiling and bottleneck detection.

Running the Example

# Basic (headless ASCII tree)
cargo run --example execution_graph

# With presentar-terminal TreeNode
cargo run --example execution_graph --features presentar-tui

Headless ASCII Tree

Zero-dependency visualization for CI/CD and automation:

use trueno::{BrickProfiler, BrickId, ExecutionNode};

let mut profiler = BrickProfiler::new();
profiler.enable();
profiler.enable_graph();

// Record a transformer layer
profiler.graph_push_scope(ExecutionNode::Layer { index: 0 });

// Record a brick with its kernel
profiler.graph_push_scope(ExecutionNode::Brick {
    id: BrickId::QkvProjection,
    timing_ns: 200_000,
    elements: 4096,
});
profiler.graph_record_kernel(
    "batched_q4k_gemv",
    0xDEADBEEF,
    (32, 1, 1),   // grid
    (256, 1, 1),  // block
    4096,         // shared_mem
);
profiler.graph_pop_scope(); // pop brick
profiler.graph_pop_scope(); // pop layer

// Render to ASCII (no dependencies)
let tree = profiler.execution_graph().to_ascii_tree();
println!("{}", tree);

Output:

Layer 0
└── QkvProjection  200.0µs (4096 elem)
    └── batched_q4k_gemv  <<<32,256,1>>> smem=4096B

Full Example Output

Execution Graph
├── Layer 0
│   ├── RmsNorm  50.0µs (4096 elem)
│   │   └── rmsnorm_kernel  <<<16,256,1>>> smem=1024B
│   ├── QkvProjection  200.0µs (4096 elem)
│   │   └── batched_q4k_gemv  <<<32,256,1>>> smem=4096B
│   ├── AttentionScore  150.0µs (4096 elem)
│   │   └── incremental_attention  <<<8,256,1>>> smem=2048B
│   └── GateProjection  300.0µs (4096 elem)
│       └── batched_q6k_gemv  <<<64,256,1>>> smem=8192B
└── Layer 1
    ├── RmsNorm  50.0µs (4096 elem)
    │   └── rmsnorm_kernel  <<<16,256,1>>> smem=1024B
    ...

Use Cases

Use CaseMethodDependencies
CI/CD logsto_ascii_tree()None
Snapshot teststo_ascii_tree()None
File exportto_ascii_tree()None
Interactive TUIto_tree_node()presentar-tui feature
SVG visualizationto_dot()External graphviz

PTX Registry

Track PTX source code for kernel debugging:

use trueno::PtxRegistry;

let mut registry = PtxRegistry::new();
registry.register("kernel_name", ptx_source, Some(Path::new("src/kernel.ptx")));

// Lookup by hash
let hash = PtxRegistry::hash_ptx(ptx_source);
let source = registry.lookup(hash);

Graphviz Export

# Generate DOT file
cargo run --example execution_graph 2>/dev/null | grep -A1000 "digraph" > /tmp/graph.dot

# Or in code:
let dot = profiler.graph_to_dot();
std::fs::write("graph.dot", dot)?;

# Render to SVG
dot -Tsvg graph.dot -o graph.svg

Query Helpers

let graph = profiler.execution_graph();

// Find all kernel nodes
for (id, node) in graph.kernel_nodes() {
    println!("{}: {:?}", id.0, node);
}

// Find slowest brick with kernel
if let Some((id, node, timing_ns)) = graph.slowest_kernel() {
    println!("Bottleneck: {:?} at {}µs", node, timing_ns / 1000);
}

// Check scope balance
assert!(graph.is_scope_balanced());

Critical Path Analysis (Phase 9)

Identify true bottlenecks vs parallelizable work using longest-path analysis:

use trueno::ExecutionGraph;

// After recording execution...
let (critical_path, total_ns) = graph.critical_path();

println!("Critical path: {} nodes, {:.2}ms total",
    critical_path.len(),
    total_ns as f64 / 1_000_000.0);

// Get formatted summary with parallelization opportunities
println!("{}", graph.critical_path_summary());

Output:

Critical Path: 0.70ms (3 nodes)
──────────────────────────────────────────────────
┌ RmsNorm (100.0µs)
│ QkvProjection (200.0µs)
└ GateProjection (400.0µs)

Parallelization Opportunities (high slack):
  AttentionScore slack=100.0µs

Slack Calculation

Nodes with positive slack can be parallelized without affecting total time:

let slack = graph.compute_slack();

for (node_id, slack_ns) in &slack {
    if *slack_ns > 0 {
        println!("Node {} can be delayed by {}µs", node_id.0, slack_ns / 1000);
    }
}

Roofline Integration

Measure distance from theoretical peak performance:

// Device: RTX 4090 (83 TFLOPS, 1008 GB/s)
let distances = graph.roofline_distance(83.0, 1008.0);

for (node_id, distance) in &distances {
    let efficiency = (1.0 - distance) * 100.0;
    println!("Kernel {} at {:.1}% of roofline", node_id.0, efficiency);
}

Record kernels with roofline metrics:

graph.record_kernel_launch_with_metrics(
    "matmul_kernel",
    ptx_hash,
    (128, 1, 1),      // grid
    (256, 1, 1),      // block
    16384,            // shared_mem
    150_000,          // timing_ns
    50.0,             // arithmetic_intensity (FLOPs/byte)
    42.0,             // achieved_tflops
);

Data Movement Tracking

Track H2D/D2H/D2D transfers and detect wasteful ping-pong patterns:

use trueno::TransferDirection;

// Record transfers
graph.record_transfer("host_weights", "device_weights",
    4 * 1024 * 1024, // 4MB
    TransferDirection::H2D,
    Some(50_000)); // 50µs

// Detect ping-pong anti-pattern
let ping_pongs = graph.detect_ping_pong();
if !ping_pongs.is_empty() {
    println!("Warning: {} wasteful transfer patterns detected", ping_pongs.len());
}

Edge Types

Edge TypePurpose
ContainsLayer contains bricks
LaunchesBrick launches kernel
CallsFunction calls function
SequenceSequential execution
DependsOnCUDA event dependency
TransferMemory transfer with bytes and direction

Integration with realizar

The execution graph integrates with the realizar inference engine:

// In CudaExecutor
executor.set_profiler_sync_mode(SyncMode::Deferred);

// During forward pass - graph records automatically
let timer = executor.start_brick_id(BrickId::QkvProjection);
// ... kernel launch ...
executor.stop_brick_id(timer, hidden_dim as u64);

// Export graph after inference
let graph = executor.profiler().execution_graph();
println!("{}", graph.to_ascii_tree());

// Phase 9: Analyze critical path
println!("{}", graph.critical_path_summary());