Optimization Guide

This chapter covers performance optimization techniques used in Trueno, with a focus on PTX code generation and kernel emission.

PTX Emission Optimization

The PTX code generator has been optimized to minimize memory allocations during kernel generation, achieving a 20.9% improvement in emission performance.

Key Optimizations

1. Pre-allocated String Capacity

Instead of growing the output string dynamically, we estimate the final size:

// Pre-allocate with estimated size: ~100 bytes per instruction + header overhead
let estimated_size = 512 + self.instructions.len() * 100;
let mut ptx = String::with_capacity(estimated_size);

This eliminates repeated reallocations as the PTX output grows.

2. Zero-Allocation Instruction Emission

The write_instruction() function writes directly to the output buffer instead of returning intermediate Strings:

// Before (allocates per instruction):
for instr in &self.instructions {
    ptx.push_str(&emit_instruction(instr));  // allocates String
}

// After (zero allocation):
for instr in &self.instructions {
    write_instruction(instr, &mut ptx);  // writes directly
}

3. Display Implementation for VirtualReg

Added Display trait implementation for zero-allocation register formatting:

impl fmt::Display for VirtualReg {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(f, "{}{}", self.ty.register_prefix(), self.id)
    }
}

// Now can use write! macro directly:
write!(out, "{}", vreg);  // No intermediate allocation

Performance Results

MetricBeforeAfterImprovement
ptx_module_emit509 ns415 ns-20.9%

Kernel Generation Performance:

KernelTimeSize
gemm_naive_648.87 µs1579 bytes
gemm_tiled_12815.06 µs2626 bytes
gemm_tensor_core44.10 µs7759 bytes
gemm_wmma_fp1626.44 µs3775 bytes
softmax_102410.05 µs1769 bytes
layernorm_102415.62 µs2788 bytes
attention_64_6422.78 µs3930 bytes
q4k_3227.67 µs4319 bytes

Throughput: 68,316 kernels/sec

Benchmarking

Run the kernel generation benchmark:

cargo run -p trueno-gpu --release --example bench_kernel_gen

General Optimization Principles

1. Minimize Allocations in Hot Paths

  • Pre-allocate collections with known sizes
  • Use &str instead of String where possible
  • Use write! to write directly to buffers

2. Use Static Strings

Many PTX components are static and can use &'static str:

pub const fn to_ptx_string(self) -> &'static str {
    match self {
        Self::F32 => ".f32",
        Self::U32 => ".u32",
        // ...
    }
}

3. Avoid Intermediate Allocations

Instead of:

fn emit() -> String {
    format!("{}{}", prefix, suffix)  // allocates
}
out.push_str(&emit());  // pushes

Use:

fn write_to(out: &mut String) {
    out.push_str(prefix);
    out.push_str(suffix);  // no intermediate allocation
}

SIMD Backend Optimization

For SIMD backend optimizations, see:

GPU Performance

For GPU-specific optimizations, see: