PTX Register Allocation Architecture
This chapter explains trueno-gpu's approach to register allocation, which delegates physical register assignment to NVIDIA's ptxas compiler. This is a pragmatic design that leverages 30+ years of GPU compiler optimization.
The Traditional Compiler Problem
In traditional compilers (like LLVM for x86), you must map an infinite number of variables to a finite set of physical registers (e.g., RAX, RDI, RSI on x86-64). This requires complex algorithms:
- Graph Coloring: Model register interference as a graph, color with K colors (K = number of physical registers)
- Linear Scan: Faster but less optimal allocation for JIT compilers
These algorithms are complex to implement correctly and require significant engineering effort.
Trueno's Strategy: Virtual Registers + ptxas
Trueno takes a different approach that leverages PTX's design as a virtual ISA:
┌─────────────────────────────────────────────────────────────┐
│ Trueno PTX Builder (Rust) │
│ - Allocates unlimited virtual registers (%f0, %f1, ...) │
│ - Tracks liveness for pressure REPORTING │
│ - Emits SSA-style PTX │
└─────────────────────────────────────────────────────────────┘
│
PTX Source
│
▼
┌─────────────────────────────────────────────────────────────┐
│ NVIDIA ptxas (JIT Compiler) │
│ - Graph coloring for physical register allocation │
│ - Register spilling to local memory if needed │
│ - Dead code elimination, constant folding, etc. │
└─────────────────────────────────────────────────────────────┘
│
SASS Binary
│
▼
┌─────────────────────────────────────────────────────────────┐
│ GPU Execution │
└─────────────────────────────────────────────────────────────┘
How It Works
- Virtual Register Allocation: Each operation allocates a new virtual register with a monotonically increasing ID:
// In trueno-gpu's KernelBuilder
pub fn add_f32(&mut self, a: VirtualReg, b: VirtualReg) -> VirtualReg {
// Allocate NEW virtual register (SSA style)
let dst = self.registers.allocate_virtual(PtxType::F32);
self.instructions.push(
PtxInstruction::new(PtxOp::Add, PtxType::F32)
.dst(Operand::Reg(dst))
.src(Operand::Reg(a))
.src(Operand::Reg(b))
);
dst // Return %f2, %f3, %f4, etc.
}
- Per-Type Namespaces: PTX requires separate register namespaces per type:
| Type | Prefix | Example |
|---|---|---|
.f32 | %f | %f0, %f1, %f2 |
.f64 | %fd | %fd0, %fd1 |
.u32 | %r | %r0, %r1 |
.u64 | %rd | %rd0, %rd1 |
.pred | %p | %p0, %p1 |
- Emitted PTX: The builder emits register declarations and instructions:
.visible .entry vector_add(
.param .u64 a_ptr,
.param .u64 b_ptr,
.param .u64 c_ptr,
.param .u32 n
) {
.reg .f32 %f<3>; // Virtual registers %f0, %f1, %f2
.reg .u32 %r<5>; // Virtual registers %r0-4
.reg .u64 %rd<7>; // Virtual registers %rd0-6
.reg .pred %p<1>; // Predicate register %p0
// Instructions use virtual registers
mov.u32 %r0, %tid.x;
mov.u32 %r1, %ctaid.x;
// ...
add.rn.f32 %f2, %f0, %f1;
// ...
}
- ptxas Does the Rest: NVIDIA's
ptxascompiler:- Builds an interference graph from virtual register liveness
- Performs graph coloring to assign physical registers
- Generates spill code if necessary (to
.localmemory) - Applies optimization passes
Why This Design?
1. Pragmatism (Avoid Muda)
NVIDIA has invested 30+ years into GPU compiler optimization. Reimplementing graph coloring would be:
- Redundant (ptxas already does it)
- Inferior (we can't match NVIDIA's GPU-specific knowledge)
- Wasteful engineering effort (Muda in Toyota terms)
2. PTX is Designed for This
PTX (Parallel Thread Execution) is explicitly designed as a virtual ISA:
- Unlimited virtual registers
- SSA (Static Single Assignment) form
- Meant to be lowered by a backend compiler
From the PTX ISA documentation:
"PTX defines a virtual machine and ISA for general purpose parallel thread execution."
3. Focus on What Matters
Trueno focuses on:
- Algorithm correctness: Ensuring SIMD/GPU operations produce correct results
- High-level optimization: Tiling, kernel fusion, memory access patterns
- Developer experience: Safe, ergonomic Rust API
Low-level optimization (register allocation, instruction scheduling) is delegated to specialized tools.
Register Pressure Monitoring
While we don't perform graph coloring, we DO track liveness for diagnostics:
pub struct RegisterAllocator {
type_counters: HashMap<PtxType, u32>,
live_ranges: HashMap<(PtxType, u32), LiveRange>,
spill_count: usize, // Muda tracking
}
impl RegisterAllocator {
pub fn pressure_report(&self) -> RegisterPressure {
RegisterPressure {
max_live: self.allocated.len(),
spill_count: self.spill_count,
utilization: max_live as f64 / 256.0,
}
}
}
Why Track Pressure?
- Developer Warnings: Alert when kernels exceed 256 registers/thread
- Occupancy Estimation: High register usage reduces concurrent threads
- Performance Debugging: Identify kernels that may suffer from register spills
GPU Register Limits
| Architecture | Registers/Thread | Registers/SM |
|---|---|---|
| Volta (sm_70) | 256 | 65,536 |
| Turing (sm_75) | 256 | 65,536 |
| Ampere (sm_80) | 256 | 65,536 |
| Ada (sm_89) | 256 | 65,536 |
Occupancy Impact: If a kernel uses 64 registers/thread, an SM with 65,536 registers can run 1024 threads. If it uses 128 registers/thread, only 512 threads can run concurrently.
In-Place Operations for Register Reuse
For loops and accumulators, SSA-style allocation wastes registers:
// SSA style - allocates new register each iteration
for _ in 0..1000 {
let new_sum = ctx.add_f32(sum, val); // New register each time!
sum = new_sum;
}
We provide in-place operations that reuse registers:
// In-place style - reuses existing register
let acc = ctx.mov_f32_imm(0.0); // Allocate once
for _ in 0..1000 {
ctx.add_f32_inplace(acc, val); // Reuses %f0
}
Available In-Place Operations
| Operation | Use Case |
|---|---|
add_u32_inplace(dst, imm) | Loop counters |
add_f32_inplace(dst, src) | Accumulators |
fma_f32_inplace(dst, a, b) | GEMM accumulation |
max_f32_inplace(dst, src) | Online softmax |
mul_f32_inplace(dst, src) | Scaling |
div_f32_inplace(dst, src) | Normalization |
shr_u32_inplace(dst, imm) | Stride halving |
Potential Future Enhancements
The current design delegates all register allocation to ptxas. Potential future enhancements (tracked in GitHub Issue #66):
1. Greedy Register Reuse
For kernels exceeding 256 registers, we could implement simple liveness-based reuse:
// Hypothetical future API
let allocator = RegisterAllocator::new()
.with_reuse_strategy(ReuseStrategy::Greedy);
This would reuse %r2 after its last use, reducing virtual register count.
2. ptxas Output Parsing
Parse cuobjdump --dump-resource-usage output to validate:
- Expected vs actual register usage
- Spill detection
- Occupancy calculation
3. Occupancy Calculator
Integrate NVIDIA's occupancy calculator to predict SM utilization before runtime.
Best Practices
1. Use In-Place Operations for Loops
// Good - register reuse
let i = ctx.mov_u32_imm(0);
ctx.label("loop");
// ... loop body ...
ctx.add_u32_inplace(i, 1); // Reuses %r0
ctx.branch("loop");
// Bad - register explosion
let mut i = ctx.mov_u32_imm(0);
ctx.label("loop");
// ... loop body ...
i = ctx.add_u32(i, 1); // New register each iteration!
ctx.branch("loop");
2. Limit Unroll Factors
Each unrolled iteration adds registers. Balance throughput vs pressure:
// High pressure - 8x unroll
for i in 0..8 {
let val = ctx.ld_global_f32(addr[i]);
ctx.fma_f32_inplace(acc, val, weights[i]);
}
// Lower pressure - 4x unroll (often sufficient)
for i in 0..4 {
let val = ctx.ld_global_f32(addr[i]);
ctx.fma_f32_inplace(acc, val, weights[i]);
}
3. Use Shared Memory for Large Temporaries
Instead of keeping many values in registers, stage through shared memory:
// Use shared memory tile instead of many registers
let tile = ctx.alloc_shared::<f32>(TILE_SIZE * TILE_SIZE);
4. Monitor Kernel Complexity
For complex kernels, check register pressure:
let pressure = kernel.registers.pressure_report();
if pressure.utilization > 0.5 {
eprintln!("Warning: High register pressure ({:.0}%)",
pressure.utilization * 100.0);
}
Running the Example
cargo run -p trueno-gpu --example register_allocation
This demonstrates:
- Simple kernel with low register pressure
- Complex kernel with higher pressure (unrolled dot product)
- In-place operations for register reuse
- Architectural trade-offs
References
- PTX ISA Documentation
- CUDA Occupancy Calculator
- GitHub Issue #66: Liveness-Based Register Reuse
- Example:
trueno-gpu/examples/register_allocation.rs