LZ4 Compression

trueno-gpu provides GPU-accelerated LZ4 compression kernels for high-throughput data compression. The implementation uses a warp-cooperative architecture optimized for ZRAM-style memory page compression.

Overview

LZ4 is a fast compression algorithm that trades compression ratio for speed. It's widely used in:

ZRAM (Linux kernel memory compression)
ZFS/Btrfs (filesystem compression)
Database systems (page compression)
Real-time streaming (network compression)

trueno-gpu generates both NVIDIA PTX and WebGPU WGSL shaders from pure Rust, requiring no external toolchains.

Architecture

Warp-per-Page Strategy

Each warp (32 threads) processes one 4KB memory page cooperatively:

Block (128 threads = 4 warps)
┌────────┬────────┬────────┬────────┐
│ Warp 0 │ Warp 1 │ Warp 2 │ Warp 3 │
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │
└────────┴────────┴────────┴────────┘

Each warp executes three phases:

Cooperative Load: All 32 threads load 128 bytes each (4KB total)
Zero-Page Detection: Parallel OR reduction to detect all-zero pages
Hash-based Matching: Find LZ4 match references (future enhancement)

LZ4 Block Format

LZ4 encodes data as sequences of:

Literals: Raw uncompressed bytes
Matches: Back-references to previously seen data (offset + length)

Token format: [4-bit literal length][4-bit match length]

API Reference

CPU Reference Implementation

The CPU functions are useful for testing and validation:

use trueno_gpu::kernels::lz4::{
    lz4_compress_block,
    lz4_decompress_block,
    lz4_hash,
    PAGE_SIZE,
};

// Compress a 4KB page
let input = vec![0u8; PAGE_SIZE as usize];
let mut compressed = vec![0u8; PAGE_SIZE as usize];
let size = lz4_compress_block(&input, &mut compressed)?;

// Decompress back
let mut decompressed = vec![0u8; PAGE_SIZE as usize];
let decomp_size = lz4_decompress_block(&compressed[..size], &mut decompressed)?;
assert_eq!(decompressed, input);

GPU Kernel Generation

Generate PTX or WGSL for GPU execution:

use trueno_gpu::kernels::{Lz4WarpCompressKernel, Kernel};

// Create kernel for batch of 1000 pages
let kernel = Lz4WarpCompressKernel::new(1000);

// Get launch configuration
let (grid_x, grid_y, grid_z) = kernel.grid_dim();   // (250, 1, 1)
let (block_x, block_y, block_z) = kernel.block_dim(); // (128, 1, 1)
let shared_mem = kernel.shared_memory_bytes();        // 49152 bytes

// Generate NVIDIA PTX
let ptx = kernel.emit_ptx();

// Generate WebGPU WGSL (cross-platform)
let wgsl = kernel.emit_wgsl();

// Analyze barrier safety (PARITY-114 prevention)
let safety = kernel.analyze_barrier_safety();
assert!(safety.is_safe);

Constants

use trueno_gpu::kernels::lz4::{
    LZ4_MIN_MATCH,    // 4 - Minimum match length
    LZ4_MAX_MATCH,    // 274 - Maximum match length
    LZ4_HASH_BITS,    // 12 - Hash table index bits
    LZ4_HASH_SIZE,    // 4096 - Hash table entries
    LZ4_HASH_MULT,    // Knuth multiplicative hash constant
    LZ4_MAX_OFFSET,   // 65535 - Maximum match offset
    PAGE_SIZE,        // 4096 - Standard page size
};

Compression Ratios

Typical compression ratios for different data types:

Data Type	Input Size	Compressed	Ratio
Zero page	4096 bytes	~20 bytes	200:1
Repeated pattern	4096 bytes	~275 bytes	15:1
Text data	4096 bytes	~130 bytes	30:1
Random data	4096 bytes	~4100 bytes	<1:1

Example

Run the complete LZ4 demo:

cargo run -p trueno-gpu --example lz4_compression

Output:

╔══════════════════════════════════════════════════════════════╗
║       trueno-gpu LZ4 Compression Kernel Demo                ║
║   Pure Rust PTX/WGSL Generation - No nvcc Required          ║
╚══════════════════════════════════════════════════════════════╝

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Part 1: CPU Reference Implementation
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Zero Page (4KB):
    Original:   4096 bytes
    Compressed: 20 bytes
    Ratio:      204.8:1

Dual Backend Support

The LZ4 kernel supports both:

NVIDIA CUDA: Via pure Rust PTX generation (no nvcc)
WebGPU: Via WGSL generation (Vulkan/Metal/DX12/WebGPU)

This enables GPU-accelerated compression on any platform with GPU compute support.

Use Cases

ZRAM Memory Compression

Linux ZRAM uses LZ4 for memory page compression:

// Process 72GB of memory pages (18M pages)
let kernel = Lz4WarpCompressKernel::new(18_000_000);
let ptx = kernel.emit_ptx();
// Launch on GPU for 100+ GB/s throughput

Database Page Compression

Compress database pages before writing to storage:

// Batch compress 1000 database pages
let kernel = Lz4WarpCompressKernel::new(1000);
let (grid, _, _) = kernel.grid_dim();
// 250 blocks × 4 pages/block = 1000 pages

Real-time Streaming

Compress network data with minimal latency:

// Small batches for low latency
let kernel = Lz4WarpCompressKernel::new(64);
// Grid: (16, 1, 1) - fits in single SM

Performance Considerations

Zero-page detection: Pages that are all zeros compress to ~20 bytes (huge savings for sparse data)
Batch size: Larger batches amortize kernel launch overhead
Shared memory: 48KB per block limits occupancy
PCIe transfer: For small batches, CPU may be faster due to transfer overhead

Future Enhancements

Full hash-based match finding on GPU
Streaming compression with ring buffers
Multi-frame dictionary support
Hardware-accelerated entropy coding

Trueno - High-Performance SIMD/GPU Compute Library