GPU Batch Compression
trueno-zram supports CUDA GPU acceleration for batch compression of memory pages.
Current Status (2026-01-06)
Important: GPU compression is currently blocked by NVIDIA PTX bug F082 (Computed Address Bug). The production architecture uses:
- Compression: CPU SIMD (AVX-512) at 20-30 GB/s with 3.87x ratio
- Decompression: CPU parallel at 50+ GB/s (primary path)
GPU decompression is available but CPU parallel path is faster due to PCIe transfer overhead (~6 GB/s end-to-end vs 50+ GB/s CPU).
NVIDIA F082 Bug (Computed Address Bug)
F081 (Loaded Value Bug) was FALSIFIED on 2026-01-05 - the pattern works correctly.
The actual bug is F082: addresses computed from shared memory values cause crashes:
// F081 pattern - WORKS CORRECTLY (falsified):
ld.shared.u32 %r_val, [addr]; // Load from shared memory
st.global.u32 [dest], %r_val; // Actually works!
// F082 pattern - CRASHES:
ld.shared.u32 %r_offset, [shared_addr]; // Load offset from shared memory
add.u64 %r_dest, %r_base, %r_offset; // Compute destination address
st.global.u32 [%r_dest], %r_data; // CRASH - address derived from shared load
Status: True root cause identified via Popperian falsification. See KF-002 and ublk-batched-gpu-compression.md for details.
When to Use GPU
GPU decompression is beneficial when:
- Large batches: 2000+ pages to decompress
- PCIe 5x rule satisfied: Computation time > 5x transfer time
- GPU available: CUDA-capable GPU with SM 7.0+
Basic Usage
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{GpuBatchCompressor, GpuBatchConfig, gpu_available};
use trueno_zram_core::{Algorithm, PAGE_SIZE};
if gpu_available() {
let config = GpuBatchConfig {
device_index: 0,
algorithm: Algorithm::Lz4,
batch_size: 1000,
async_dma: true,
ring_buffer_slots: 4,
};
let mut compressor = GpuBatchCompressor::new(config)?;
let pages: Vec<[u8; PAGE_SIZE]> = vec![[0u8; PAGE_SIZE]; 1000];
let result = compressor.compress_batch(&pages)?;
println!("Compression ratio: {:.2}x", result.compression_ratio());
}
}
Configuration Options
#![allow(unused)]
fn main() {
pub struct GpuBatchConfig {
/// CUDA device index (0 = first GPU)
pub device_index: u32,
/// Compression algorithm
pub algorithm: Algorithm,
/// Number of pages per batch
pub batch_size: usize,
/// Enable async DMA transfers
pub async_dma: bool,
/// Ring buffer slots for pipelining
pub ring_buffer_slots: usize,
}
}
Batch Results
The BatchResult provides timing breakdown:
#![allow(unused)]
fn main() {
let result = compressor.compress_batch(&pages)?;
// Timing components
println!("H2D transfer: {} ns", result.h2d_time_ns);
println!("Kernel execution: {} ns", result.kernel_time_ns);
println!("D2H transfer: {} ns", result.d2h_time_ns);
println!("Total time: {} ns", result.total_time_ns);
// Metrics
let throughput = result.throughput_bytes_per_sec(pages.len() * PAGE_SIZE);
println!("Throughput: {:.2} GB/s", throughput / 1e9);
println!("Compression ratio: {:.2}x", result.compression_ratio());
println!("PCIe 5x rule satisfied: {}", result.pcie_rule_satisfied());
}
Backend Selection
Use select_backend to determine optimal backend:
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{select_backend, gpu_available};
let batch_size = 5000;
let has_gpu = gpu_available();
let backend = select_backend(batch_size, has_gpu);
println!("Selected backend: {:?}", backend);
// Output: Gpu (for large batches with GPU available)
}
Supported GPUs
| GPU | Architecture | SM | Optimal Batch |
|---|---|---|---|
| H100 | Hopper | 9.0 | 10,240 pages |
| A100 | Ampere | 8.0 | 8,192 pages |
| RTX 4090 | Ada | 8.9 | 14,745 pages |
| RTX 3090 | Ampere | 8.6 | 6,144 pages |
Pure Rust PTX Generation
trueno-zram uses trueno-gpu for pure Rust PTX generation:
- No LLVM dependency
- No nvcc required
- Kernel code in Rust, compiled to PTX at runtime
- Warp-cooperative LZ4 compression (4 warps/block)
#![allow(unused)]
fn main() {
// The LZ4 kernel processes 4 pages per block (1 page per warp)
// Uses shared memory for hash tables and match finding
// cvta.shared.u64 for generic addressing
}