PTX Register Allocation Architecture

This chapter explains trueno-gpu's approach to register allocation, which delegates physical register assignment to NVIDIA's ptxas compiler. This is a pragmatic design that leverages 30+ years of GPU compiler optimization.

The Traditional Compiler Problem

In traditional compilers (like LLVM for x86), you must map an infinite number of variables to a finite set of physical registers (e.g., RAX, RDI, RSI on x86-64). This requires complex algorithms:

Graph Coloring: Model register interference as a graph, color with K colors (K = number of physical registers)
Linear Scan: Faster but less optimal allocation for JIT compilers

These algorithms are complex to implement correctly and require significant engineering effort.

Trueno's Strategy: Virtual Registers + ptxas

Trueno takes a different approach that leverages PTX's design as a virtual ISA:

┌─────────────────────────────────────────────────────────────┐
│  Trueno PTX Builder (Rust)                                  │
│  - Allocates unlimited virtual registers (%f0, %f1, ...)    │
│  - Tracks liveness for pressure REPORTING                   │
│  - Emits SSA-style PTX                                      │
└─────────────────────────────────────────────────────────────┘
                             │
                        PTX Source
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│  NVIDIA ptxas (JIT Compiler)                                │
│  - Graph coloring for physical register allocation          │
│  - Register spilling to local memory if needed              │
│  - Dead code elimination, constant folding, etc.            │
└─────────────────────────────────────────────────────────────┘
                             │
                        SASS Binary
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│  GPU Execution                                              │
└─────────────────────────────────────────────────────────────┘

How It Works

Virtual Register Allocation: Each operation allocates a new virtual register with a monotonically increasing ID:

// In trueno-gpu's KernelBuilder
pub fn add_f32(&mut self, a: VirtualReg, b: VirtualReg) -> VirtualReg {
    // Allocate NEW virtual register (SSA style)
    let dst = self.registers.allocate_virtual(PtxType::F32);
    self.instructions.push(
        PtxInstruction::new(PtxOp::Add, PtxType::F32)
            .dst(Operand::Reg(dst))
            .src(Operand::Reg(a))
            .src(Operand::Reg(b))
    );
    dst  // Return %f2, %f3, %f4, etc.
}

Per-Type Namespaces: PTX requires separate register namespaces per type:

Type	Prefix	Example
`.f32`	`%f`	`%f0`, `%f1`, `%f2`
`.f64`	`%fd`	`%fd0`, `%fd1`
`.u32`	`%r`	`%r0`, `%r1`
`.u64`	`%rd`	`%rd0`, `%rd1`
`.pred`	`%p`	`%p0`, `%p1`

Emitted PTX: The builder emits register declarations and instructions:

.visible .entry vector_add(
    .param .u64 a_ptr,
    .param .u64 b_ptr,
    .param .u64 c_ptr,
    .param .u32 n
) {
    .reg .f32  %f<3>;    // Virtual registers %f0, %f1, %f2
    .reg .u32  %r<5>;    // Virtual registers %r0-4
    .reg .u64  %rd<7>;   // Virtual registers %rd0-6
    .reg .pred %p<1>;    // Predicate register %p0

    // Instructions use virtual registers
    mov.u32 %r0, %tid.x;
    mov.u32 %r1, %ctaid.x;
    // ...
    add.rn.f32 %f2, %f0, %f1;
    // ...
}

ptxas Does the Rest: NVIDIA's ptxas compiler:
- Builds an interference graph from virtual register liveness
- Performs graph coloring to assign physical registers
- Generates spill code if necessary (to .local memory)
- Applies optimization passes

Why This Design?

1. Pragmatism (Avoid Muda)

NVIDIA has invested 30+ years into GPU compiler optimization. Reimplementing graph coloring would be:

Redundant (ptxas already does it)
Inferior (we can't match NVIDIA's GPU-specific knowledge)
Wasteful engineering effort (Muda in Toyota terms)

2. PTX is Designed for This

PTX (Parallel Thread Execution) is explicitly designed as a virtual ISA:

Unlimited virtual registers
SSA (Static Single Assignment) form
Meant to be lowered by a backend compiler

From the PTX ISA documentation:

"PTX defines a virtual machine and ISA for general purpose parallel thread execution."

3. Focus on What Matters

Trueno focuses on:

Algorithm correctness: Ensuring SIMD/GPU operations produce correct results
High-level optimization: Tiling, kernel fusion, memory access patterns
Developer experience: Safe, ergonomic Rust API

Low-level optimization (register allocation, instruction scheduling) is delegated to specialized tools.

Register Pressure Monitoring

While we don't perform graph coloring, we DO track liveness for diagnostics:

pub struct RegisterAllocator {
    type_counters: HashMap<PtxType, u32>,
    live_ranges: HashMap<(PtxType, u32), LiveRange>,
    spill_count: usize,  // Muda tracking
}

impl RegisterAllocator {
    pub fn pressure_report(&self) -> RegisterPressure {
        RegisterPressure {
            max_live: self.allocated.len(),
            spill_count: self.spill_count,
            utilization: max_live as f64 / 256.0,
        }
    }
}

Why Track Pressure?

Developer Warnings: Alert when kernels exceed 256 registers/thread
Occupancy Estimation: High register usage reduces concurrent threads
Performance Debugging: Identify kernels that may suffer from register spills

GPU Register Limits

Architecture	Registers/Thread	Registers/SM
Volta (sm_70)	256	65,536
Turing (sm_75)	256	65,536
Ampere (sm_80)	256	65,536
Ada (sm_89)	256	65,536

Occupancy Impact: If a kernel uses 64 registers/thread, an SM with 65,536 registers can run 1024 threads. If it uses 128 registers/thread, only 512 threads can run concurrently.

In-Place Operations for Register Reuse

For loops and accumulators, SSA-style allocation wastes registers:

// SSA style - allocates new register each iteration
for _ in 0..1000 {
    let new_sum = ctx.add_f32(sum, val);  // New register each time!
    sum = new_sum;
}

We provide in-place operations that reuse registers:

// In-place style - reuses existing register
let acc = ctx.mov_f32_imm(0.0);  // Allocate once
for _ in 0..1000 {
    ctx.add_f32_inplace(acc, val);  // Reuses %f0
}

Available In-Place Operations

Operation	Use Case
`add_u32_inplace(dst, imm)`	Loop counters
`add_f32_inplace(dst, src)`	Accumulators
`fma_f32_inplace(dst, a, b)`	GEMM accumulation
`max_f32_inplace(dst, src)`	Online softmax
`mul_f32_inplace(dst, src)`	Scaling
`div_f32_inplace(dst, src)`	Normalization
`shr_u32_inplace(dst, imm)`	Stride halving

Potential Future Enhancements

The current design delegates all register allocation to ptxas. Potential future enhancements (tracked in GitHub Issue #66):

1. Greedy Register Reuse

For kernels exceeding 256 registers, we could implement simple liveness-based reuse:

// Hypothetical future API
let allocator = RegisterAllocator::new()
    .with_reuse_strategy(ReuseStrategy::Greedy);

This would reuse %r2 after its last use, reducing virtual register count.

2. ptxas Output Parsing

Parse cuobjdump --dump-resource-usage output to validate:

Expected vs actual register usage
Spill detection
Occupancy calculation

// Good - register reuse
let i = ctx.mov_u32_imm(0);
ctx.label("loop");
// ... loop body ...
ctx.add_u32_inplace(i, 1);  // Reuses %r0
ctx.branch("loop");

// Bad - register explosion
let mut i = ctx.mov_u32_imm(0);
ctx.label("loop");
// ... loop body ...
i = ctx.add_u32(i, 1);  // New register each iteration!
ctx.branch("loop");

2. Limit Unroll Factors

Each unrolled iteration adds registers. Balance throughput vs pressure:

// High pressure - 8x unroll
for i in 0..8 {
    let val = ctx.ld_global_f32(addr[i]);
    ctx.fma_f32_inplace(acc, val, weights[i]);
}

// Lower pressure - 4x unroll (often sufficient)
for i in 0..4 {
    let val = ctx.ld_global_f32(addr[i]);
    ctx.fma_f32_inplace(acc, val, weights[i]);
}

3. Use Shared Memory for Large Temporaries

Instead of keeping many values in registers, stage through shared memory:

// Use shared memory tile instead of many registers
let tile = ctx.alloc_shared::<f32>(TILE_SIZE * TILE_SIZE);

4. Monitor Kernel Complexity

For complex kernels, check register pressure:

let pressure = kernel.registers.pressure_report();
if pressure.utilization > 0.5 {
    eprintln!("Warning: High register pressure ({:.0}%)",
              pressure.utilization * 100.0);
}

Running the Example

cargo run -p trueno-gpu --example register_allocation

This demonstrates:

Simple kernel with low register pressure
Complex kernel with higher pressure (unrolled dot product)
In-place operations for register reuse
Architectural trade-offs

References

PTX ISA Documentation
CUDA Occupancy Calculator
GitHub Issue #66: Liveness-Based Register Reuse
Example: trueno-gpu/examples/register_allocation.rs

Trueno - High-Performance SIMD/GPU Compute Library