PTX Optimization Passes

This chapter documents the PTX optimization passes in trueno-gpu, aligned with NVIDIA's official CUDA Tile IR (CUDA Toolkit 13.1).

Overview

The trueno_gpu::ptx::optimize module provides four optimization passes:

Pass	Description	Benefit
FMA Fusion	`mul + add` → `fma`	Reduced latency, single rounding
Loop Splitting	Conditional loop splitting	Eliminates branch divergence
Token-Based Ordering	Memory dependency tracking	Barrier elimination
Tile Validation	Power-of-two constraints	Prevents register pressure

FMA Fusion Pass

The FMA (Fused Multiply-Add) fusion pass detects mul + add instruction patterns and fuses them into a single fma instruction.

Benefits

Latency: Single instruction instead of two
Precision: Single rounding operation (IEEE 754 compliant)
Throughput: Utilizes GPU FMA units efficiently

Example

use trueno_gpu::ptx::optimize::fma_fusion;
use trueno_gpu::ptx::{Operand, PtxInstruction, PtxOp, PtxType, VirtualReg};

// Create mul + add pattern
let r0 = VirtualReg::new(0, PtxType::F32);
let r1 = VirtualReg::new(1, PtxType::F32);
let r2 = VirtualReg::new(2, PtxType::F32);
let r3 = VirtualReg::new(3, PtxType::F32);

let mul = PtxInstruction::new(PtxOp::Mul, PtxType::F32)
    .dst(Operand::Reg(r2.clone()))
    .src(Operand::Reg(r0.clone()))
    .src(Operand::Reg(r1.clone()));

let add = PtxInstruction::new(PtxOp::Add, PtxType::F32)
    .dst(Operand::Reg(r3))
    .src(Operand::Reg(r2))
    .src(Operand::ImmF32(1.0));

// Fuse to single FMA instruction
let fused = fma_fusion::pass(vec![mul, add]);
assert_eq!(fused.len(), 1); // mul + add → fma

Academic Reference

Based on Click & Paleczny (1995) "A Simple Graph-Based Intermediate Representation" for SSA pattern matching.

Loop Splitting Pass

The loop splitting pass analyzes conditional loops and identifies opportunities to split them at condition boundaries, eliminating branch divergence in GPU warps.

Heavy Operations

The following operations trigger split profitability:

Ld - Memory loads
St - Memory stores
WmmaMma - Tensor Core MMA
WmmaLoadA, WmmaLoadB, WmmaLoadC - WMMA fragment loads
WmmaStoreD - WMMA fragment stores

Example

use trueno_gpu::ptx::optimize::loop_split;
use trueno_gpu::ptx::{PtxInstruction, PtxOp, PtxType, CmpOp};

// Check profitability
let heavy_op = PtxInstruction::new(PtxOp::Ld, PtxType::F32);
assert!(loop_split::is_split_profitable(&[heavy_op], 10));

let light_op = PtxInstruction::new(PtxOp::Add, PtxType::F32);
assert!(!loop_split::is_split_profitable(&[light_op], 10));

// Split point alignment for non-unit steps
assert_eq!(loop_split::align_split_point(5, 0, 4), 8);
assert_eq!(loop_split::align_split_point(8, 0, 4), 8);

// Loop predicate conversion
assert_eq!(
    loop_split::LoopPredicate::from_cmp_op(CmpOp::Lt),
    Some(loop_split::LoopPredicate::LessThan)
);

Ordering	PTX Modifier	Description
`Weak`	`.weak`	No ordering guarantees
`Relaxed`	`.relaxed`	Relaxed consistency
`Acquire`	`.acquire`	Acquire semantics
`Release`	`.release`	Release semantics

Memory Scopes

Scope	PTX Modifier	Description
`Thread`	`.cta`	Thread-local
`Block`	`.cta`	Block-local
`Cluster`	`.cluster`	Cluster-local
`Device`	`.gpu`	Device-wide
`System`	`.sys`	System-wide

Example

use trueno_gpu::ptx::optimize::tko;

// Create tokens for memory operations
let t1 = tko::Token::new();
let t2 = tko::Token::new();
let t3 = tko::Token::new();

// Join tokens at synchronization point
let joined = tko::join_tokens(&[t1, t2, t3]);

// Memory ordering
let ordering = tko::MemoryOrdering::Acquire;
assert_eq!(ordering.to_ptx_modifier(), ".acquire");

// Memory scope
let scope = tko::MemoryScope::Device;
assert_eq!(scope.to_ptx_scope(), ".gpu");

// Token graph with cycle detection
let mut graph = tko::TokenGraph::new();
let ta = tko::Token::new();
let tb = tko::Token::new();
let tc = tko::Token::new();

graph.create_token(ta);
graph.create_token(tb);
graph.create_token(tc);
graph.add_dependency(tb, ta);
graph.add_dependency(tc, tb);

assert!(!graph.has_cycle()); // No deadlock

graph.add_dependency(ta, tc);
assert!(graph.has_cycle()); // DEADLOCK!

Power-of-two dimensions: Required for efficient GPU scheduling
Maximum tile elements: 16M elements to prevent register spills
Maximum single dimension: 4096 to prevent degenerate shapes

WMMA Valid Shapes

Shape	Description
`M16N16K16`	Standard 16×16×16
`M8N32K16`	Alternate 8×32×16
`M32N8K16`	Alternate 32×8×16

Example

use trueno_gpu::ptx::optimize::tile_validation;
use trueno_gpu::ptx::WmmaShape;

// Valid shapes
assert!(tile_validation::validate_shape(&[16, 16]).is_ok());
assert!(tile_validation::validate_shape(&[32, 32]).is_ok());
assert!(tile_validation::validate_shape(&[64, 64]).is_ok());

// Invalid shapes
assert!(tile_validation::validate_shape(&[17, 16]).is_err()); // Not power of two
assert!(tile_validation::validate_shape(&[100, 100]).is_err());

// WMMA shapes
let valid_wmma = WmmaShape::M16N16K16;
assert!(tile_validation::validate_wmma_shape(&valid_wmma).is_ok());

let invalid_wmma = WmmaShape { m: 24, n: 24, k: 16 };
assert!(tile_validation::validate_wmma_shape(&invalid_wmma).is_err());

Academic Reference

Based on Volkov & Demmel (2008) "Benchmarking GPUs to Tune Dense Linear Algebra".

Running the Example

cargo run --example ptx_optimize

Output:

╔══════════════════════════════════════════════════════════════╗
║     PTX Optimization Passes (NVIDIA CUDA Tile IR Aligned)    ║
╚══════════════════════════════════════════════════════════════╝

1️⃣  FMA FUSION PASS
   Input:  2 instructions (mul + add)
   Output: 1 instruction (fma)

2️⃣  LOOP SPLITTING PASS
   Heavy ops trigger split: true
   Light ops trigger split: false

3️⃣  TOKEN-BASED ORDERING (TKO)
   Tokens created with unique IDs
   Cycle detection: working

4️⃣  TILE VALIDATION
   Power-of-two shapes: OK
   Invalid shapes: rejected

✅ All optimization demos completed successfully!

Specification

Full specification: cuda-tile-behavior.md (v1.4.0)

Coverage

Module	Coverage
`fma_fusion.rs`	93.75%
`loop_split.rs`	99.80%
`tko.rs`	94.29%
`tile_validation.rs`	88.64%
Total	94.28%

Trueno - High-Performance SIMD/GPU Compute Library

PTX Optimization Passes

Overview

FMA Fusion Pass

Benefits

Example

Academic Reference

Loop Splitting Pass

Heavy Operations

Example

NVIDIA Reference

Token-Based Ordering (TKO)

Memory Ordering Semantics

Memory Scopes

Example

NVIDIA Reference

Tile Validation

Constraints

WMMA Valid Shapes

Example

Academic Reference

Running the Example

Specification

Coverage