PTX Register Allocation Architecture

This chapter explains trueno-gpu's approach to register allocation, which delegates physical register assignment to NVIDIA's ptxas compiler. This is a pragmatic design that leverages 30+ years of GPU compiler optimization.

The Traditional Compiler Problem

In traditional compilers (like LLVM for x86), you must map an infinite number of variables to a finite set of physical registers (e.g., RAX, RDI, RSI on x86-64). This requires complex algorithms:

  • Graph Coloring: Model register interference as a graph, color with K colors (K = number of physical registers)
  • Linear Scan: Faster but less optimal allocation for JIT compilers

These algorithms are complex to implement correctly and require significant engineering effort.

Trueno's Strategy: Virtual Registers + ptxas

Trueno takes a different approach that leverages PTX's design as a virtual ISA:

┌─────────────────────────────────────────────────────────────┐
│  Trueno PTX Builder (Rust)                                  │
│  - Allocates unlimited virtual registers (%f0, %f1, ...)    │
│  - Tracks liveness for pressure REPORTING                   │
│  - Emits SSA-style PTX                                      │
└─────────────────────────────────────────────────────────────┘
                             │
                        PTX Source
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│  NVIDIA ptxas (JIT Compiler)                                │
│  - Graph coloring for physical register allocation          │
│  - Register spilling to local memory if needed              │
│  - Dead code elimination, constant folding, etc.            │
└─────────────────────────────────────────────────────────────┘
                             │
                        SASS Binary
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│  GPU Execution                                              │
└─────────────────────────────────────────────────────────────┘

How It Works

  1. Virtual Register Allocation: Each operation allocates a new virtual register with a monotonically increasing ID:
// In trueno-gpu's KernelBuilder
pub fn add_f32(&mut self, a: VirtualReg, b: VirtualReg) -> VirtualReg {
    // Allocate NEW virtual register (SSA style)
    let dst = self.registers.allocate_virtual(PtxType::F32);
    self.instructions.push(
        PtxInstruction::new(PtxOp::Add, PtxType::F32)
            .dst(Operand::Reg(dst))
            .src(Operand::Reg(a))
            .src(Operand::Reg(b))
    );
    dst  // Return %f2, %f3, %f4, etc.
}
  1. Per-Type Namespaces: PTX requires separate register namespaces per type:
TypePrefixExample
.f32%f%f0, %f1, %f2
.f64%fd%fd0, %fd1
.u32%r%r0, %r1
.u64%rd%rd0, %rd1
.pred%p%p0, %p1
  1. Emitted PTX: The builder emits register declarations and instructions:
.visible .entry vector_add(
    .param .u64 a_ptr,
    .param .u64 b_ptr,
    .param .u64 c_ptr,
    .param .u32 n
) {
    .reg .f32  %f<3>;    // Virtual registers %f0, %f1, %f2
    .reg .u32  %r<5>;    // Virtual registers %r0-4
    .reg .u64  %rd<7>;   // Virtual registers %rd0-6
    .reg .pred %p<1>;    // Predicate register %p0

    // Instructions use virtual registers
    mov.u32 %r0, %tid.x;
    mov.u32 %r1, %ctaid.x;
    // ...
    add.rn.f32 %f2, %f0, %f1;
    // ...
}
  1. ptxas Does the Rest: NVIDIA's ptxas compiler:
    • Builds an interference graph from virtual register liveness
    • Performs graph coloring to assign physical registers
    • Generates spill code if necessary (to .local memory)
    • Applies optimization passes

Why This Design?

1. Pragmatism (Avoid Muda)

NVIDIA has invested 30+ years into GPU compiler optimization. Reimplementing graph coloring would be:

  • Redundant (ptxas already does it)
  • Inferior (we can't match NVIDIA's GPU-specific knowledge)
  • Wasteful engineering effort (Muda in Toyota terms)

2. PTX is Designed for This

PTX (Parallel Thread Execution) is explicitly designed as a virtual ISA:

  • Unlimited virtual registers
  • SSA (Static Single Assignment) form
  • Meant to be lowered by a backend compiler

From the PTX ISA documentation:

"PTX defines a virtual machine and ISA for general purpose parallel thread execution."

3. Focus on What Matters

Trueno focuses on:

  • Algorithm correctness: Ensuring SIMD/GPU operations produce correct results
  • High-level optimization: Tiling, kernel fusion, memory access patterns
  • Developer experience: Safe, ergonomic Rust API

Low-level optimization (register allocation, instruction scheduling) is delegated to specialized tools.

Register Pressure Monitoring

While we don't perform graph coloring, we DO track liveness for diagnostics:

pub struct RegisterAllocator {
    type_counters: HashMap<PtxType, u32>,
    live_ranges: HashMap<(PtxType, u32), LiveRange>,
    spill_count: usize,  // Muda tracking
}

impl RegisterAllocator {
    pub fn pressure_report(&self) -> RegisterPressure {
        RegisterPressure {
            max_live: self.allocated.len(),
            spill_count: self.spill_count,
            utilization: max_live as f64 / 256.0,
        }
    }
}

Why Track Pressure?

  1. Developer Warnings: Alert when kernels exceed 256 registers/thread
  2. Occupancy Estimation: High register usage reduces concurrent threads
  3. Performance Debugging: Identify kernels that may suffer from register spills

GPU Register Limits

ArchitectureRegisters/ThreadRegisters/SM
Volta (sm_70)25665,536
Turing (sm_75)25665,536
Ampere (sm_80)25665,536
Ada (sm_89)25665,536

Occupancy Impact: If a kernel uses 64 registers/thread, an SM with 65,536 registers can run 1024 threads. If it uses 128 registers/thread, only 512 threads can run concurrently.

In-Place Operations for Register Reuse

For loops and accumulators, SSA-style allocation wastes registers:

// SSA style - allocates new register each iteration
for _ in 0..1000 {
    let new_sum = ctx.add_f32(sum, val);  // New register each time!
    sum = new_sum;
}

We provide in-place operations that reuse registers:

// In-place style - reuses existing register
let acc = ctx.mov_f32_imm(0.0);  // Allocate once
for _ in 0..1000 {
    ctx.add_f32_inplace(acc, val);  // Reuses %f0
}

Available In-Place Operations

OperationUse Case
add_u32_inplace(dst, imm)Loop counters
add_f32_inplace(dst, src)Accumulators
fma_f32_inplace(dst, a, b)GEMM accumulation
max_f32_inplace(dst, src)Online softmax
mul_f32_inplace(dst, src)Scaling
div_f32_inplace(dst, src)Normalization
shr_u32_inplace(dst, imm)Stride halving

Potential Future Enhancements

The current design delegates all register allocation to ptxas. Potential future enhancements (tracked in GitHub Issue #66):

1. Greedy Register Reuse

For kernels exceeding 256 registers, we could implement simple liveness-based reuse:

// Hypothetical future API
let allocator = RegisterAllocator::new()
    .with_reuse_strategy(ReuseStrategy::Greedy);

This would reuse %r2 after its last use, reducing virtual register count.

2. ptxas Output Parsing

Parse cuobjdump --dump-resource-usage output to validate:

  • Expected vs actual register usage
  • Spill detection
  • Occupancy calculation

3. Occupancy Calculator

Integrate NVIDIA's occupancy calculator to predict SM utilization before runtime.

Best Practices

1. Use In-Place Operations for Loops

// Good - register reuse
let i = ctx.mov_u32_imm(0);
ctx.label("loop");
// ... loop body ...
ctx.add_u32_inplace(i, 1);  // Reuses %r0
ctx.branch("loop");

// Bad - register explosion
let mut i = ctx.mov_u32_imm(0);
ctx.label("loop");
// ... loop body ...
i = ctx.add_u32(i, 1);  // New register each iteration!
ctx.branch("loop");

2. Limit Unroll Factors

Each unrolled iteration adds registers. Balance throughput vs pressure:

// High pressure - 8x unroll
for i in 0..8 {
    let val = ctx.ld_global_f32(addr[i]);
    ctx.fma_f32_inplace(acc, val, weights[i]);
}

// Lower pressure - 4x unroll (often sufficient)
for i in 0..4 {
    let val = ctx.ld_global_f32(addr[i]);
    ctx.fma_f32_inplace(acc, val, weights[i]);
}

3. Use Shared Memory for Large Temporaries

Instead of keeping many values in registers, stage through shared memory:

// Use shared memory tile instead of many registers
let tile = ctx.alloc_shared::<f32>(TILE_SIZE * TILE_SIZE);

4. Monitor Kernel Complexity

For complex kernels, check register pressure:

let pressure = kernel.registers.pressure_report();
if pressure.utilization > 0.5 {
    eprintln!("Warning: High register pressure ({:.0}%)",
              pressure.utilization * 100.0);
}

Running the Example

cargo run -p trueno-gpu --example register_allocation

This demonstrates:

  1. Simple kernel with low register pressure
  2. Complex kernel with higher pressure (unrolled dot product)
  3. In-place operations for register reuse
  4. Architectural trade-offs

References