PTX Code Generation (trueno-gpu)

trueno-gpu provides pure Rust PTX (Parallel Thread Execution) code generation for NVIDIA GPUs. This enables GPU kernel development without requiring LLVM, nvcc, or any external dependencies.

Philosophy

Own the Stack - Build everything from first principles for complete control, auditability, and reproducibility.

Quick Start

use trueno_gpu::ptx::{PtxModule, PtxKernel, PtxType};

// Create a PTX module
let module = PtxModule::new()
    .version(8, 0)      // PTX ISA 8.0
    .target("sm_70")    // Volta+
    .address_size(64);  // 64-bit addressing

// Build a kernel with the fluent builder API
let kernel = PtxKernel::new("my_kernel")
    .param(PtxType::U64, "data_ptr")
    .param(PtxType::U32, "n")
    .build(|ctx| {
        // Generate PTX instructions
        let tid = ctx.special_reg(trueno_gpu::ptx::PtxReg::TidX);
        // ... more instructions
        ctx.ret();
    });

// Emit PTX source
let ptx_source = module.add_kernel(kernel).emit();

Module Structure

A PTX module consists of:

Header: Version, target architecture, address size
Declarations: Register declarations, shared memory
Kernels: One or more entry points

Version and Target

// PTX ISA 8.0 for Ampere and newer
.version(8, 0)

// Target compute capability
.target("sm_70")  // Volta
.target("sm_75")  // Turing
.target("sm_80")  // Ampere
.target("sm_89")  // Ada Lovelace
.target("sm_90")  // Hopper

Kernel Builder API

The KernelBuilder provides a fluent API for generating PTX instructions:

Special Registers

// Thread and block IDs
ctx.special_reg(PtxReg::TidX);    // %tid.x
ctx.special_reg(PtxReg::TidY);    // %tid.y
ctx.special_reg(PtxReg::CtaIdX);  // %ctaid.x (block ID)
ctx.special_reg(PtxReg::NtidX);   // %ntid.x (block size)

Arithmetic Operations

// Integer arithmetic
ctx.add_u32(a, b);
ctx.mul_wide_u32(a, b);     // 32x32 -> 64 bit
ctx.mad_lo_u32(a, b, c);    // a*b + c (low 32 bits)

// Floating point
ctx.add_f32(a, b);
ctx.mul_f32(a, b);
ctx.fma_f32(a, b, c);       // Fused multiply-add

Memory Operations

// Load from global memory
let value = ctx.ld_global_f32(addr);

// Store to global memory
ctx.st_global_f32(addr, value);

// Load kernel parameters
let param = ctx.load_param_u32("param_name");
let ptr = ctx.load_param_u64("ptr_param");

Control Flow

// Predicated branch
let pred = ctx.setp_ge_u32(idx, n);  // idx >= n
ctx.branch_if(pred, "exit");

// Unconditional branch
ctx.branch("loop_start");

// Labels
ctx.label("loop_start");
ctx.label("exit");

// Return
ctx.ret();

Pre-built Kernels

trueno-gpu includes optimized kernel generators:

GEMM (Matrix Multiplication)

use trueno_gpu::kernels::{GemmKernel, Kernel};

// Naive GEMM (for correctness testing)
let kernel = GemmKernel::naive(1024, 1024, 1024);

// Tiled GEMM (shared memory optimization)
let kernel = GemmKernel::tiled(1024, 1024, 1024, 32);

// Tensor Core GEMM (SM 7.0+)
let kernel = GemmKernel::tensor_core(1024, 1024, 1024);

// Generate PTX
let ptx = kernel.emit_ptx();

Softmax

use trueno_gpu::kernels::{SoftmaxKernel, Kernel};

let kernel = SoftmaxKernel::new(1024);  // Vector length
let ptx = kernel.emit_ptx();

Bias + Activation (Epilogue Kernel)

Fused bias addition with optional activation function, commonly used as an epilogue after GEMM:

use trueno_gpu::kernels::{BiasActivationKernel, Activation, Kernel};

// Bias only (no activation)
let kernel = BiasActivationKernel::new(4096, 256);  // n=4096, bias_size=256

// Bias + ReLU
let kernel = BiasActivationKernel::new(4096, 256).with_relu();

// Bias + GELU (Transformer default)
let kernel = BiasActivationKernel::new(4096, 256).with_gelu();

// Custom activation via builder
let kernel = BiasActivationKernel::new(4096, 256)
    .with_activation(Activation::GELU);

let ptx = kernel.emit_ptx();

Activation	Formula	Use Case
None	`x + bias`	Linear layer epilogue
ReLU	`max(0, x + bias)`	CNN layers
GELU	`(x + bias) * sigmoid(1.702 * (x + bias))`	Transformers

Note: The bias_size is baked into the kernel at generation time for efficiency. The kernel computes output[i] += bias[i % bias_size].

# Run the example
cargo run -p trueno-gpu --example bias_activation

# Run property tests and falsification tests
cargo test -p trueno-gpu bias_activation

# Run deep bug hunt (includes BiasActivation)
cargo run -p trueno-explain --example deep_bug_hunt

Testing: BiasActivationKernel includes 22 tests covering:

Unit tests for configuration and PTX structure
Property-based tests (proptest) for randomized validation
Falsification tests verifying bounds checks, bias modulo, and activation correctness
Mutation testing: 100% coverage (2 caught by tests, 4 caught by type system)

Quantized GEMM (Q4_K, Q5_K, Q6_K)

Optimized kernels for quantized inference with GGML-compatible formats:

use trueno_gpu::kernels::{QuantizeKernel, Q5KKernel, Q6KKernel, Kernel};

// Q4_K: 4-bit quantization (144 bytes per 256 values)
let q4k = QuantizeKernel::ggml(1024, 1024, 4096);

// Q5_K: 5-bit quantization (176 bytes per 256 values) - PARITY-116
let q5k = Q5KKernel::new(1024, 1024, 4096);

// Q6_K: 6-bit quantization (210 bytes per 256 values) - PARITY-117
let q6k = Q6KKernel::new(1024, 1024, 4096);

let ptx = q5k.emit_ptx();

Format	Bits	Bytes/256	Accuracy	Use Case
Q4_K	4	144	Good	Default inference
Q5_K	5	176	Better	Quality-sensitive
Q6_K	6	210	Best	Maximum accuracy

Memory Management

use trueno_gpu::memory::{MemoryPool, PoolConfig, GpuBuffer};

// Create memory pool
let config = PoolConfig::new(1024 * 1024 * 1024);  // 1GB
let pool = MemoryPool::new(config);

// Allocate buffer
let buffer: GpuBuffer<f32> = GpuBuffer::new(1024);

Backend Detection

use trueno_gpu::backend::{detect_backend, Backend};

let backend = detect_backend();
println!("Using backend: {}", backend.name());
println!("Available: {}", backend.is_available());

Running Examples

# PTX quickstart - vector addition kernel
cargo run -p trueno-gpu --example ptx_quickstart

# GEMM kernel generation
cargo run -p trueno-gpu --example gemm_kernel

# Bias + Activation epilogue kernel
cargo run -p trueno-gpu --example bias_activation

# Quantized GEMM (Q5_K/Q6_K)
cargo run -p trueno-gpu --example q5k_q6k_gemm

PTX Type System

Rust Type	PTX Type	Description
`PtxType::U32`	`.u32`	32-bit unsigned
`PtxType::U64`	`.u64`	64-bit unsigned
`PtxType::S32`	`.s32`	32-bit signed
`PtxType::F32`	`.f32`	Single precision
`PtxType::F64`	`.f64`	Double precision
`PtxType::F16`	`.f16`	Half precision
`PtxType::BF16`	`.bf16`	Brain float
`PtxType::Pred`	`.pred`	Predicate (1-bit)

State Spaces

State Space	PTX	Scope	Speed
Register	`.reg`	Per-thread	Fastest
Shared	`.shared`	Per-block	Fast
Global	`.global`	Device-wide	Slow
Local	`.local`	Per-thread spill	Slow
Constant	`.const`	Device-wide (cached)	Fast
Parameter	`.param`	Kernel args	-

Best Practices

Minimize global memory access - Use shared memory for data reuse
Coalesce memory accesses - Adjacent threads access adjacent memory
Use FMA instructions - fma_f32 is faster than separate mul+add
Avoid branch divergence - Keep warps executing the same path
Maximize occupancy - Balance register usage vs parallelism

Feature Flags

[dependencies]
trueno-gpu = { version = "0.1", features = ["cuda"] }

default - PTX generation only (no CUDA runtime required)
cuda - Enable CUDA driver FFI for actual execution

Trueno - High-Performance SIMD/GPU Compute Library