PTX Optimization Passes

This chapter documents the PTX optimization passes in trueno-gpu, aligned with NVIDIA's official CUDA Tile IR (CUDA Toolkit 13.1).

Overview

The trueno_gpu::ptx::optimize module provides four optimization passes:

PassDescriptionBenefit
FMA Fusionmul + addfmaReduced latency, single rounding
Loop SplittingConditional loop splittingEliminates branch divergence
Token-Based OrderingMemory dependency trackingBarrier elimination
Tile ValidationPower-of-two constraintsPrevents register pressure

FMA Fusion Pass

The FMA (Fused Multiply-Add) fusion pass detects mul + add instruction patterns and fuses them into a single fma instruction.

Benefits

  • Latency: Single instruction instead of two
  • Precision: Single rounding operation (IEEE 754 compliant)
  • Throughput: Utilizes GPU FMA units efficiently

Example

use trueno_gpu::ptx::optimize::fma_fusion;
use trueno_gpu::ptx::{Operand, PtxInstruction, PtxOp, PtxType, VirtualReg};

// Create mul + add pattern
let r0 = VirtualReg::new(0, PtxType::F32);
let r1 = VirtualReg::new(1, PtxType::F32);
let r2 = VirtualReg::new(2, PtxType::F32);
let r3 = VirtualReg::new(3, PtxType::F32);

let mul = PtxInstruction::new(PtxOp::Mul, PtxType::F32)
    .dst(Operand::Reg(r2.clone()))
    .src(Operand::Reg(r0.clone()))
    .src(Operand::Reg(r1.clone()));

let add = PtxInstruction::new(PtxOp::Add, PtxType::F32)
    .dst(Operand::Reg(r3))
    .src(Operand::Reg(r2))
    .src(Operand::ImmF32(1.0));

// Fuse to single FMA instruction
let fused = fma_fusion::pass(vec![mul, add]);
assert_eq!(fused.len(), 1); // mul + add → fma

Academic Reference

Based on Click & Paleczny (1995) "A Simple Graph-Based Intermediate Representation" for SSA pattern matching.

Loop Splitting Pass

The loop splitting pass analyzes conditional loops and identifies opportunities to split them at condition boundaries, eliminating branch divergence in GPU warps.

Heavy Operations

The following operations trigger split profitability:

  • Ld - Memory loads
  • St - Memory stores
  • WmmaMma - Tensor Core MMA
  • WmmaLoadA, WmmaLoadB, WmmaLoadC - WMMA fragment loads
  • WmmaStoreD - WMMA fragment stores

Example

use trueno_gpu::ptx::optimize::loop_split;
use trueno_gpu::ptx::{PtxInstruction, PtxOp, PtxType, CmpOp};

// Check profitability
let heavy_op = PtxInstruction::new(PtxOp::Ld, PtxType::F32);
assert!(loop_split::is_split_profitable(&[heavy_op], 10));

let light_op = PtxInstruction::new(PtxOp::Add, PtxType::F32);
assert!(!loop_split::is_split_profitable(&[light_op], 10));

// Split point alignment for non-unit steps
assert_eq!(loop_split::align_split_point(5, 0, 4), 8);
assert_eq!(loop_split::align_split_point(8, 0, 4), 8);

// Loop predicate conversion
assert_eq!(
    loop_split::LoopPredicate::from_cmp_op(CmpOp::Lt),
    Some(loop_split::LoopPredicate::LessThan)
);

NVIDIA Reference

Aligned with LoopSplit.cpp from NVIDIA CUDA Tile IR (CUDA Toolkit 13.1).

Token-Based Ordering (TKO)

Token-Based Ordering provides explicit memory dependency tracking, enabling compiler-driven barrier elimination.

Memory Ordering Semantics

OrderingPTX ModifierDescription
Weak.weakNo ordering guarantees
Relaxed.relaxedRelaxed consistency
Acquire.acquireAcquire semantics
Release.releaseRelease semantics

Memory Scopes

ScopePTX ModifierDescription
Thread.ctaThread-local
Block.ctaBlock-local
Cluster.clusterCluster-local
Device.gpuDevice-wide
System.sysSystem-wide

Example

use trueno_gpu::ptx::optimize::tko;

// Create tokens for memory operations
let t1 = tko::Token::new();
let t2 = tko::Token::new();
let t3 = tko::Token::new();

// Join tokens at synchronization point
let joined = tko::join_tokens(&[t1, t2, t3]);

// Memory ordering
let ordering = tko::MemoryOrdering::Acquire;
assert_eq!(ordering.to_ptx_modifier(), ".acquire");

// Memory scope
let scope = tko::MemoryScope::Device;
assert_eq!(scope.to_ptx_scope(), ".gpu");

// Token graph with cycle detection
let mut graph = tko::TokenGraph::new();
let ta = tko::Token::new();
let tb = tko::Token::new();
let tc = tko::Token::new();

graph.create_token(ta);
graph.create_token(tb);
graph.create_token(tc);
graph.add_dependency(tb, ta);
graph.add_dependency(tc, tb);

assert!(!graph.has_cycle()); // No deadlock

graph.add_dependency(ta, tc);
assert!(graph.has_cycle()); // DEADLOCK!

NVIDIA Reference

Aligned with memory_consistency_ops.mlir from NVIDIA CUDA Tile IR.

Tile Validation

Tile validation enforces constraints to prevent register pressure issues and compilation hangs.

Constraints

  1. Power-of-two dimensions: Required for efficient GPU scheduling
  2. Maximum tile elements: 16M elements to prevent register spills
  3. Maximum single dimension: 4096 to prevent degenerate shapes

WMMA Valid Shapes

ShapeDescription
M16N16K16Standard 16×16×16
M8N32K16Alternate 8×32×16
M32N8K16Alternate 32×8×16

Example

use trueno_gpu::ptx::optimize::tile_validation;
use trueno_gpu::ptx::WmmaShape;

// Valid shapes
assert!(tile_validation::validate_shape(&[16, 16]).is_ok());
assert!(tile_validation::validate_shape(&[32, 32]).is_ok());
assert!(tile_validation::validate_shape(&[64, 64]).is_ok());

// Invalid shapes
assert!(tile_validation::validate_shape(&[17, 16]).is_err()); // Not power of two
assert!(tile_validation::validate_shape(&[100, 100]).is_err());

// WMMA shapes
let valid_wmma = WmmaShape::M16N16K16;
assert!(tile_validation::validate_wmma_shape(&valid_wmma).is_ok());

let invalid_wmma = WmmaShape { m: 24, n: 24, k: 16 };
assert!(tile_validation::validate_wmma_shape(&invalid_wmma).is_err());

Academic Reference

Based on Volkov & Demmel (2008) "Benchmarking GPUs to Tune Dense Linear Algebra".

Running the Example

cargo run --example ptx_optimize

Output:

╔══════════════════════════════════════════════════════════════╗
║     PTX Optimization Passes (NVIDIA CUDA Tile IR Aligned)    ║
╚══════════════════════════════════════════════════════════════╝

1️⃣  FMA FUSION PASS
   Input:  2 instructions (mul + add)
   Output: 1 instruction (fma)

2️⃣  LOOP SPLITTING PASS
   Heavy ops trigger split: true
   Light ops trigger split: false

3️⃣  TOKEN-BASED ORDERING (TKO)
   Tokens created with unique IDs
   Cycle detection: working

4️⃣  TILE VALIDATION
   Power-of-two shapes: OK
   Invalid shapes: rejected

✅ All optimization demos completed successfully!

Specification

Full specification: cuda-tile-behavior.md (v1.4.0)

Coverage

ModuleCoverage
fma_fusion.rs93.75%
loop_split.rs99.80%
tko.rs94.29%
tile_validation.rs88.64%
Total94.28%