Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Benchmarks

Running Benchmarks

# Criterion benchmarks
cargo bench --all-features

# With baseline comparison
cargo bench --all-features -- --save-baseline main

# Example benchmarks
cargo run -p trueno-zram-core --example compress_benchmark --release

Results Summary

LZ4 Performance

BackendCompressDecompressRatio
AVX-5124.4 GB/s5.4 GB/s2-4x
AVX23.2 GB/s4.1 GB/s2-4x
Scalar3.0 GB/s3.8 GB/s2-4x

ZSTD Performance

BackendLevelCompressDecompressRatio
AVX-512111.2 GB/s46 GB/s3-5x
AVX-51238.5 GB/s45 GB/s4-6x
AVX218.5 GB/s35 GB/s3-5x

Same-Fill Performance

BackendDetectionRatio
AVX-51222 GB/s2048:1
AVX218 GB/s2048:1
Scalar12 GB/s2048:1

Data Patterns

Zeros (Best Case)

Pattern: Zeros (100% same-fill)
Pages: 1000
Compression: 22 GB/s
Decompression: 46 GB/s
Ratio: 2048:1

Text (Compressible)

Pattern: Text/Code
Pages: 1000
LZ4 Compression: 4.4 GB/s
LZ4 Decompression: 5.4 GB/s
Ratio: 3.2:1

Random (Incompressible)

Pattern: Random bytes
Pages: 1000
LZ4 Compression: 1.6 GB/s
LZ4 Decompression: 32 GB/s
Ratio: 1.0:1 (pass-through)

GPU Benchmarks

Note (2026-01-06): GPU decompression is limited by PCIe transfer overhead. CPU parallel path (50+ GB/s) is faster for most workloads.

RTX 4090 (Validated)

PathThroughputNotes
CPU Parallel50+ GB/sPrimary recommended path
GPU End-to-End~6 GB/sPCIe 4.0 transfer bottleneck
GPU Kernel-only~9 GB/sWithout H2D/D2H transfers

Recommendation

Use CPU parallel decompression for best performance. GPU useful for:

  • Future PCIe 5.0+ systems with higher bandwidth
  • Workloads where CPU is saturated

Latency

OperationP50P99P99.9
LZ4 compress (4KB)45us85us120us
LZ4 decompress (4KB)38us72us95us
Same-fill detect8us15us25us

Memory Usage

ComponentMemory
Hash table (LZ4)64 KB
Working buffer16 KB
ZSTD context256 KB

Comparison with Linux Kernel zram

Validated 2026-01-06 on RTX 4090 + AMD Threadripper 7960X (AVX-512)

Block Device I/O (fio, Direct I/O)

MetricKernel ZRAMtrueno-ublkSpeedup
Sequential Read9.2 GB/s16.5 GB/s1.8x
Random 4K IOPS55K249K4.5x
Compression Ratio2.5x3.87x+55%

Compression Engine (cargo examples)

MetricLinux Kernel zramtrueno-zramSpeedup
Compress (parallel)3-5 GB/s20-30 GB/s5-6x
Decompress (CPU)~10 GB/s50+ GB/s5x
Same-fill detection~8 GB/s22 GB/s2.75x
Compression ratio2.5x3.87x+55%

Architecture Difference

AspectLinux Kerneltrueno-zram
ThreadingSingle-threaded per pageParallel (rayon)
SIMDLimitedAVX-512/AVX2/NEON
Batch processingNoYes (5000+ pages)
GPU offloadNoOptional CUDA

Falsification Testing (2026-01-06)

Test Configuration:
├── Hardware: AMD Threadripper 7960X, 125GB RAM, RTX 4090
├── Device: trueno-ublk 8GB device
├── Tool: fio, cargo examples

Results:
├── Sequential I/O: 16.5 GB/s ✓ (claim: 12.5 GB/s)
├── Random IOPS: 249K ✓ (claim: 228K)
├── Compression: 30.66 GB/s ✓ (claim: 20-24 GB/s)
├── Ratio: 3.87x ✓ (claim: 3.7x)
├── mlock: 272 MB ✓ (claim: >100 MB)
├── Stress test: No deadlock ✓
└── Status: 6/8 PASS (GPU claims deprecated)

Why trueno-zram is Faster

  1. SIMD vectorization: AVX-512 processes 64 bytes per instruction vs byte-by-byte
  2. Parallel compression: All CPU cores compress simultaneously via rayon
  3. Batch amortization: Setup costs spread across thousands of pages
  4. Cache efficiency: Sequential memory access patterns
  5. Zero-copy paths: Same-fill pages detected without compression attempt