Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

trueno-zram is a SIMD and GPU-accelerated memory compression library for Linux systems. It provides userspace Rust implementations of LZ4 and ZSTD compression that leverage modern CPU vector instructions (AVX2, AVX-512, NEON) and optional CUDA GPU acceleration.

Production Status (2026-01-06)

MILESTONE DT-005 ACHIEVED: trueno-zram is running as system swap!

  • 8GB device active as primary swap (priority 150)
  • Validated under memory pressure with ~185MB swap utilization
  • CPU SIMD compress (20-30 GB/s) + CPU parallel decompress (48 GB/s)

DT-007 COMPLETED: Swap deadlock issue FIXED via mlock() - 211 MB daemon memory pinned.

Known Limitations:

  • GPU compression blocked by NVIDIA F082 bug (F081 was falsified, using CPU SIMD instead)
  • I/O throughput lower than kernel ZRAM (userspace overhead)

Why trueno-zram?

trueno-zram provides better compression efficiency than kernel ZRAM:

Advantagetrueno-zramKernel ZRAM
Compression Ratio3.87x2.5x
Space Efficiency55% betterbaseline
P99 Latency16.5 µsvaries

Trade-off: Kernel ZRAM has higher raw I/O throughput (operates entirely in kernel space). trueno-zram uses ublk which adds userspace overhead but provides:

  • Runtime SIMD dispatch: Automatically selects AVX-512, AVX2, or NEON based on CPU
  • Userspace flexibility: debugging, monitoring, custom algorithms
  • Adaptive algorithm selection: ML-driven selection based on page entropy
  • Better compression: 3.87x vs kernel’s 2.5x

Validated Performance (QA-FALSIFY-001)

ClaimResultStatus
Compression ratio3.87xPASS
SIMD compression20-30 GB/sPASS
SIMD decompression48 GB/sPASS
P99 latency16.5 µsPASS
mlock (DT-007)211 MBPASS
Falsified ClaimActual
1.8x vs kernel I/OKernel 3-13x faster
228K IOPS123K IOPS

All metrics independently verified via falsification testing (2026-01-06)

Part of the PAIML Ecosystem

trueno-zram is part of the “Batuta Stack”:

  • trueno - High-performance SIMD compute library
  • trueno-gpu - Pure Rust PTX generation for CUDA
  • aprender - Machine learning in pure Rust
  • certeza - Asymptotic test effectiveness framework

License

trueno-zram is dual-licensed under MIT and Apache-2.0.

Installation

From crates.io

Add trueno-zram-core to your Cargo.toml:

[dependencies]
trueno-zram-core = "0.1"

Or use cargo add:

cargo add trueno-zram-core

With CUDA Support

For GPU acceleration, enable the cuda feature:

[dependencies]
trueno-zram-core = { version = "0.1", features = ["cuda"] }

Or:

cargo add trueno-zram-core --features cuda

CUDA Requirements

  • CUDA Toolkit 12.8 or later
  • NVIDIA driver supporting CUDA 12.8
  • GPU with compute capability >= 7.0 (Volta or newer)

Feature Flags

FeatureDescriptionDefault
stdStandard library supportYes
nightlyNightly-only SIMD featuresNo
cudaCUDA GPU accelerationNo

System Requirements

  • OS: Linux (kernel >= 5.10 LTS)
  • CPU: x86_64 (AVX2/AVX-512) or AArch64 (NEON)
  • Rust: 1.82.0 or later (MSRV)

Verifying Installation

use trueno_zram_core::{CompressorBuilder, Algorithm};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let compressor = CompressorBuilder::new()
        .algorithm(Algorithm::Lz4)
        .build()?;

    println!("trueno-zram installed successfully!");
    println!("SIMD backend: {:?}", compressor.backend());

    Ok(())
}

Building from Source

git clone https://github.com/paiml/trueno-zram
cd trueno-zram

# Build all crates
cargo build --release --all-features

# Run tests
cargo test --workspace --all-features

# Build with CUDA
cargo build --release --features cuda

Quick Start

Basic Compression

use trueno_zram_core::{CompressorBuilder, Algorithm, PAGE_SIZE};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a compressor with LZ4 algorithm
    let compressor = CompressorBuilder::new()
        .algorithm(Algorithm::Lz4)
        .build()?;

    // Create a test page (4KB)
    let mut page = [0u8; PAGE_SIZE];
    page[0..100].copy_from_slice(&[0xAA; 100]);

    // Compress
    let compressed = compressor.compress(&page)?;
    println!("Original size: {} bytes", PAGE_SIZE);
    println!("Compressed size: {} bytes", compressed.data.len());
    println!("Ratio: {:.2}x", compressed.ratio());

    // Decompress
    let decompressed = compressor.decompress(&compressed)?;
    assert_eq!(page, decompressed);
    println!("Decompression verified!");

    Ok(())
}

Choosing an Algorithm

#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};

// LZ4: Fastest compression, good for most workloads
let lz4 = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;

// ZSTD Level 1: Better ratio, still fast
let zstd_fast = CompressorBuilder::new()
    .algorithm(Algorithm::Zstd { level: 1 })
    .build()?;

// ZSTD Level 3: Best ratio for compressible data
let zstd_best = CompressorBuilder::new()
    .algorithm(Algorithm::Zstd { level: 3 })
    .build()?;

// Adaptive: Automatically selects based on entropy
let adaptive = CompressorBuilder::new()
    .algorithm(Algorithm::Adaptive)
    .build()?;
}

Compression Statistics

#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;

// Compress some pages
for _ in 0..100 {
    let page = [0u8; PAGE_SIZE];
    let _ = compressor.compress(&page)?;
}

// Get statistics
let stats = compressor.stats();
println!("Pages compressed: {}", stats.pages);
println!("Total input: {} bytes", stats.bytes_in);
println!("Total output: {} bytes", stats.bytes_out);
println!("Overall ratio: {:.2}x", stats.ratio());
println!("Throughput: {:.2} GB/s", stats.throughput_gbps());
}

Error Handling

use trueno_zram_core::{CompressorBuilder, Algorithm, Error};

fn compress_page(data: &[u8; PAGE_SIZE]) -> Result<Vec<u8>, Error> {
    let compressor = CompressorBuilder::new()
        .algorithm(Algorithm::Lz4)
        .build()?;

    let compressed = compressor.compress(data)?;
    Ok(compressed.data)
}

fn main() {
    let page = [0u8; PAGE_SIZE];

    match compress_page(&page) {
        Ok(data) => println!("Compressed to {} bytes", data.len()),
        Err(Error::BufferTooSmall(msg)) => eprintln!("Buffer error: {msg}"),
        Err(Error::CorruptedData(msg)) => eprintln!("Corrupt data: {msg}"),
        Err(e) => eprintln!("Other error: {e}"),
    }
}

Next Steps

Examples

Running Examples

trueno-zram includes several examples to demonstrate its features:

# GPU information and backend selection
cargo run -p trueno-zram-core --example gpu_info
cargo run -p trueno-zram-core --example gpu_info --features cuda

# Compression benchmarks
cargo run -p trueno-zram-core --example compress_benchmark --release
cargo run -p trueno-zram-core --example compress_benchmark --release --features cuda

GPU Info Example

Shows GPU detection, backend selection logic, and PCIe 5x rule evaluation:

trueno-zram GPU Information
============================

1. GPU Availability
   ----------------
   GPU available: true

   CUDA Device Information:
     Device: NVIDIA GeForce RTX 4090
     Compute Capability: SM 8.9
     Memory: 24564 MB
     L2 Cache: 73728 KB
     Optimal batch: 14745 pages
     Supported: true

2. Backend Selection Logic
   -----------------------
   GPU_MIN_BATCH_SIZE: 1000 pages
   PAGE_SIZE: 4096 bytes

      Batch          No GPU        With GPU
   ------------------------------------------
         1          Scalar          Scalar
        10            Simd            Simd
       100            Simd            Simd
       500            Simd            Simd
      1000            Simd             Gpu
      5000            Simd             Gpu
     10000            Simd             Gpu

3. PCIe 5x Rule Evaluation
   -----------------------
   GPU offload beneficial when: T_cpu > 5 * (T_transfer + T_gpu)

   1K pages, PCIe 4.0, 100 GB/s GPU (4 MB): CPU preferred
   10K pages, PCIe 4.0, 100 GB/s GPU (40 MB): GPU beneficial
   100K pages, PCIe 5.0, 500 GB/s GPU (400 MB): GPU beneficial

Compression Benchmark

Measures throughput across different data patterns:

trueno-zram Compression Benchmark
=================================

Pattern: Zeros (compressible)
----------------------------------------------------------------------
   Pages  Algorithm     Compress   Decompress      Ratio    Backend
     100        Lz4      22.01 GB/s      46.02 GB/s   2048.00x Avx512
    1000        Lz4      21.87 GB/s      45.91 GB/s   2048.00x Avx512

Pattern: Text (compressible)
----------------------------------------------------------------------
   Pages  Algorithm     Compress   Decompress      Ratio    Backend
     100        Lz4       4.57 GB/s       5.42 GB/s      3.21x Avx512
    1000        Lz4       4.44 GB/s       5.37 GB/s      3.21x Avx512

Pattern: Random (incompressible)
----------------------------------------------------------------------
   Pages  Algorithm     Compress   Decompress      Ratio    Backend
     100        Lz4       1.87 GB/s      43.78 GB/s      1.00x Avx512
    1000        Lz4       1.61 GB/s      31.64 GB/s      1.00x Avx512

Custom Example: Batch Processing

use trueno_zram_core::{CompressorBuilder, Algorithm, PAGE_SIZE};
use std::time::Instant;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let compressor = CompressorBuilder::new()
        .algorithm(Algorithm::Lz4)
        .build()?;

    // Generate test pages
    let pages: Vec<[u8; PAGE_SIZE]> = (0..1000)
        .map(|i| {
            let mut page = [0u8; PAGE_SIZE];
            // Create compressible pattern
            for (j, byte) in page.iter_mut().enumerate() {
                *byte = ((i + j) % 256) as u8;
            }
            page
        })
        .collect();

    // Benchmark compression
    let start = Instant::now();
    let mut total_compressed = 0;

    for page in &pages {
        let compressed = compressor.compress(page)?;
        total_compressed += compressed.data.len();
    }

    let elapsed = start.elapsed();
    let input_bytes = pages.len() * PAGE_SIZE;
    let throughput = input_bytes as f64 / elapsed.as_secs_f64() / 1e9;
    let ratio = input_bytes as f64 / total_compressed as f64;

    println!("Compressed {} pages in {:?}", pages.len(), elapsed);
    println!("Throughput: {:.2} GB/s", throughput);
    println!("Compression ratio: {:.2}x", ratio);

    Ok(())
}

Custom Example: GPU Batch Compression

use trueno_zram_core::gpu::{GpuBatchCompressor, GpuBatchConfig, gpu_available};
use trueno_zram_core::{Algorithm, PAGE_SIZE};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    if !gpu_available() {
        println!("No GPU available, skipping GPU example");
        return Ok(());
    }

    let config = GpuBatchConfig {
        device_index: 0,
        algorithm: Algorithm::Lz4,
        batch_size: 1000,
        async_dma: true,
        ring_buffer_slots: 4,
    };

    let mut compressor = GpuBatchCompressor::new(config)?;

    // Create batch of pages
    let pages: Vec<[u8; PAGE_SIZE]> = vec![[0u8; PAGE_SIZE]; 1000];

    // Compress batch
    let result = compressor.compress_batch(&pages)?;

    println!("Batch Results:");
    println!("  Pages: {}", result.pages.len());
    println!("  H2D time: {} ns", result.h2d_time_ns);
    println!("  Kernel time: {} ns", result.kernel_time_ns);
    println!("  D2H time: {} ns", result.d2h_time_ns);
    println!("  Total time: {} ns", result.total_time_ns);
    println!("  Compression ratio: {:.2}x", result.compression_ratio());
    println!("  PCIe 5x rule satisfied: {}", result.pcie_rule_satisfied());

    // Get compressor stats
    let stats = compressor.stats();
    println!("\nCompressor Stats:");
    println!("  Pages compressed: {}", stats.pages_compressed);
    println!("  Throughput: {:.2} GB/s", stats.throughput_gbps());

    Ok(())
}

Compression Algorithms

trueno-zram supports multiple compression algorithms optimized for memory page compression.

LZ4

LZ4 is a lossless compression algorithm focused on compression and decompression speed.

#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};

let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;
}

Characteristics

MetricValue
Compression speed4.4 GB/s (AVX-512)
Decompression speed5.4 GB/s (AVX-512)
Typical ratio2-4x for compressible data
Best forSpeed-critical workloads

When to Use LZ4

  • Real-time compression requirements
  • High-throughput memory compression
  • When CPU overhead must be minimal
  • Mixed workloads with varying compressibility

ZSTD (Zstandard)

ZSTD provides better compression ratios while maintaining good speed.

#![allow(unused)]
fn main() {
// Fast mode (level 1)
let fast = CompressorBuilder::new()
    .algorithm(Algorithm::Zstd { level: 1 })
    .build()?;

// Better compression (level 3)
let better = CompressorBuilder::new()
    .algorithm(Algorithm::Zstd { level: 3 })
    .build()?;
}

Compression Levels

LevelCompressionDecompressionRatio
111.2 GB/s46 GB/sBetter than LZ4
38.5 GB/s45 GB/sBest

When to Use ZSTD

  • Memory-constrained systems
  • Highly compressible data (text, logs)
  • When compression ratio matters more than speed
  • Cold/archived memory pages

Adaptive Selection

The adaptive algorithm automatically selects based on page entropy:

#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Adaptive)
    .build()?;
}

Selection Logic

  1. Same-fill detection: Pages with repeated values use 2048:1 encoding
  2. Entropy analysis: Shannon entropy determines compressibility
  3. Algorithm selection:
    • Low entropy (< 4 bits): ZSTD for best ratio
    • Medium entropy (4-7 bits): LZ4 for balance
    • High entropy (> 7 bits): Pass-through (incompressible)

Same-Fill Optimization

Pages filled with the same byte value get special 2048:1 compression:

#![allow(unused)]
fn main() {
use trueno_zram_core::samefill::{detect_same_fill, CompactSameFill};

let zero_page = [0u8; 4096];

if let Some(fill) = detect_same_fill(&zero_page) {
    let compact = CompactSameFill::new(fill);
    // Only 2 bytes needed to represent 4096-byte page!
    assert_eq!(compact.to_bytes().len(), 2);
}
}

Same-Fill Statistics

  • Zero pages: ~30-40% of typical memory
  • Same-fill pages: ~35-45% total
  • Compression ratio: 2048:1

Algorithm Comparison

AlgorithmCompressDecompressRatioUse Case
LZ44.4 GB/s5.4 GB/s2-4xGeneral
ZSTD-111.2 GB/s46 GB/s3-5xBalanced
ZSTD-38.5 GB/s45 GB/s4-6xBest ratio
Same-fillN/AN/A2048xZero/repeated
AdaptiveVariesVariesOptimalAutomatic

SIMD Acceleration

trueno-zram uses runtime CPU feature detection to select the optimal SIMD implementation.

Supported Backends

BackendInstruction SetRegister WidthPlatforms
AVX-512AVX-512F/BW/VL512-bitSkylake-X, Ice Lake, Zen 4
AVX2AVX2 + FMA256-bitHaswell+, Zen 1+
NEONARM NEON128-bitARMv8-A (AArch64)
ScalarNone64-bitAll platforms

Runtime Detection

#![allow(unused)]
fn main() {
use trueno_zram_core::simd::{detect, SimdFeatures};

let features = detect();

println!("AVX-512: {}", features.has_avx512());
println!("AVX2: {}", features.has_avx2());
println!("SSE4.2: {}", features.has_sse42());
}

Automatic Dispatch

The compressor automatically uses the best available backend:

#![allow(unused)]
fn main() {
use trueno_zram_core::CompressorBuilder;

let compressor = CompressorBuilder::new().build()?;

// Check which backend was selected
println!("Backend: {:?}", compressor.backend());
}

Performance by Backend

LZ4 Compression

BackendThroughputRelative
AVX-5124.4 GB/s1.45x
AVX23.2 GB/s1.05x
Scalar3.0 GB/s1.0x

ZSTD Compression

BackendThroughputRelative
AVX-51211.2 GB/s1.40x
AVX28.5 GB/s1.06x
Scalar8.0 GB/s1.0x

SIMD Optimizations

Hash Table Lookups (LZ4)

AVX-512 enables parallel hash probing for match finding:

// Scalar: Sequential probe
for offset in 0..16 {
    if hash_table[hash + offset] == pattern { ... }
}

// AVX-512: Parallel probe (16 comparisons at once)
let matches = _mm512_cmpeq_epi32(hash_values, pattern_broadcast);

Literal Copying

Wide vector moves for copying uncompressed literals:

// AVX-512: 64 bytes per iteration
_mm512_storeu_si512(dst, _mm512_loadu_si512(src));

// AVX2: 32 bytes per iteration
_mm256_storeu_si256(dst, _mm256_loadu_si256(src));

Match Extension

SIMD comparison for extending matches:

#![allow(unused)]
fn main() {
// Compare 64 bytes at once with AVX-512
let cmp = _mm512_cmpeq_epi8(src_chunk, dst_chunk);
let mask = _mm512_movepi8_mask(cmp);
let match_len = mask.trailing_ones();
}

Forcing a Backend

For testing or benchmarking, you can force a specific backend:

#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, SimdBackend};

// Force scalar (no SIMD)
let scalar = CompressorBuilder::new()
    .prefer_backend(SimdBackend::Scalar)
    .build()?;

// Force AVX2 (will fail if not available)
let avx2 = CompressorBuilder::new()
    .prefer_backend(SimdBackend::Avx2)
    .build()?;
}

GPU Batch Compression

trueno-zram supports CUDA GPU acceleration for batch compression of memory pages.

Current Status (2026-01-06)

Important: GPU compression is currently blocked by NVIDIA PTX bug F082 (Computed Address Bug). The production architecture uses:

  • Compression: CPU SIMD (AVX-512) at 20-30 GB/s with 3.87x ratio
  • Decompression: CPU parallel at 50+ GB/s (primary path)

GPU decompression is available but CPU parallel path is faster due to PCIe transfer overhead (~6 GB/s end-to-end vs 50+ GB/s CPU).

NVIDIA F082 Bug (Computed Address Bug)

F081 (Loaded Value Bug) was FALSIFIED on 2026-01-05 - the pattern works correctly.

The actual bug is F082: addresses computed from shared memory values cause crashes:

// F081 pattern - WORKS CORRECTLY (falsified):
ld.shared.u32 %r_val, [addr];      // Load from shared memory
st.global.u32 [dest], %r_val;      // Actually works!

// F082 pattern - CRASHES:
ld.shared.u32 %r_offset, [shared_addr];  // Load offset from shared memory
add.u64 %r_dest, %r_base, %r_offset;     // Compute destination address
st.global.u32 [%r_dest], %r_data;        // CRASH - address derived from shared load

Status: True root cause identified via Popperian falsification. See KF-002 and ublk-batched-gpu-compression.md for details.

When to Use GPU

GPU decompression is beneficial when:

  1. Large batches: 2000+ pages to decompress
  2. PCIe 5x rule satisfied: Computation time > 5x transfer time
  3. GPU available: CUDA-capable GPU with SM 7.0+

Basic Usage

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{GpuBatchCompressor, GpuBatchConfig, gpu_available};
use trueno_zram_core::{Algorithm, PAGE_SIZE};

if gpu_available() {
    let config = GpuBatchConfig {
        device_index: 0,
        algorithm: Algorithm::Lz4,
        batch_size: 1000,
        async_dma: true,
        ring_buffer_slots: 4,
    };

    let mut compressor = GpuBatchCompressor::new(config)?;

    let pages: Vec<[u8; PAGE_SIZE]> = vec![[0u8; PAGE_SIZE]; 1000];
    let result = compressor.compress_batch(&pages)?;

    println!("Compression ratio: {:.2}x", result.compression_ratio());
}
}

Configuration Options

#![allow(unused)]
fn main() {
pub struct GpuBatchConfig {
    /// CUDA device index (0 = first GPU)
    pub device_index: u32,

    /// Compression algorithm
    pub algorithm: Algorithm,

    /// Number of pages per batch
    pub batch_size: usize,

    /// Enable async DMA transfers
    pub async_dma: bool,

    /// Ring buffer slots for pipelining
    pub ring_buffer_slots: usize,
}
}

Batch Results

The BatchResult provides timing breakdown:

#![allow(unused)]
fn main() {
let result = compressor.compress_batch(&pages)?;

// Timing components
println!("H2D transfer: {} ns", result.h2d_time_ns);
println!("Kernel execution: {} ns", result.kernel_time_ns);
println!("D2H transfer: {} ns", result.d2h_time_ns);
println!("Total time: {} ns", result.total_time_ns);

// Metrics
let throughput = result.throughput_bytes_per_sec(pages.len() * PAGE_SIZE);
println!("Throughput: {:.2} GB/s", throughput / 1e9);
println!("Compression ratio: {:.2}x", result.compression_ratio());
println!("PCIe 5x rule satisfied: {}", result.pcie_rule_satisfied());
}

Backend Selection

Use select_backend to determine optimal backend:

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{select_backend, gpu_available};

let batch_size = 5000;
let has_gpu = gpu_available();

let backend = select_backend(batch_size, has_gpu);
println!("Selected backend: {:?}", backend);
// Output: Gpu (for large batches with GPU available)
}

Supported GPUs

GPUArchitectureSMOptimal Batch
H100Hopper9.010,240 pages
A100Ampere8.08,192 pages
RTX 4090Ada8.914,745 pages
RTX 3090Ampere8.66,144 pages

Pure Rust PTX Generation

trueno-zram uses trueno-gpu for pure Rust PTX generation:

  • No LLVM dependency
  • No nvcc required
  • Kernel code in Rust, compiled to PTX at runtime
  • Warp-cooperative LZ4 compression (4 warps/block)
#![allow(unused)]
fn main() {
// The LZ4 kernel processes 4 pages per block (1 page per warp)
// Uses shared memory for hash tables and match finding
// cvta.shared.u64 for generic addressing
}

Same-Fill Detection

Same-fill optimization provides 2048:1 compression for pages containing a single repeated byte value.

Why Same-Fill Matters

Memory pages often contain:

  • Zero pages: ~30-40% of typical memory (uninitialized, freed)
  • Same-fill pages: ~5-10% additional (memset patterns)

Detecting and encoding these specially provides massive compression wins.

Detection

#![allow(unused)]
fn main() {
use trueno_zram_core::samefill::detect_same_fill;
use trueno_zram_core::PAGE_SIZE;

let zero_page = [0u8; PAGE_SIZE];
let pattern_page = [0xAA; PAGE_SIZE];
let mixed_page = [0u8; PAGE_SIZE];

// Zero page detected
assert!(detect_same_fill(&zero_page).is_some());

// Pattern page detected
assert!(detect_same_fill(&pattern_page).is_some());

// Mixed content not detected
let mut mixed = [0u8; PAGE_SIZE];
mixed[100] = 0xFF;
assert!(detect_same_fill(&mixed).is_none());
}

Compact Encoding

Same-fill pages compress to just 2 bytes:

#![allow(unused)]
fn main() {
use trueno_zram_core::samefill::CompactSameFill;

let fill_value = 0u8;
let compact = CompactSameFill::new(fill_value);

// Serialize (2 bytes)
let bytes = compact.to_bytes();
assert_eq!(bytes.len(), 2);

// Deserialize
let restored = CompactSameFill::from_bytes(&bytes)?;
assert_eq!(restored.fill_value(), 0);

// Expand back to full page
let page = restored.expand();
assert_eq!(page.len(), PAGE_SIZE);
assert!(page.iter().all(|&b| b == 0));
}

Compression Ratio

Page TypeOriginalCompressedRatio
Zero-fill4096 B2 B2048:1
0xFF-fill4096 B2 B2048:1
Any same-fill4096 B2 B2048:1

Integration with Compressor

The compressor automatically detects same-fill pages:

#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};

let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;

let zero_page = [0u8; PAGE_SIZE];
let compressed = compressor.compress(&zero_page)?;

// Same-fill pages get special encoding
println!("Compressed size: {} bytes", compressed.data.len());
// Output: ~20 bytes (LZ4 minimal encoding for same-fill)
}

Performance

Same-fill detection is extremely fast:

#![allow(unused)]
fn main() {
// SIMD-accelerated comparison
// AVX-512: Check 64 bytes per iteration
// AVX2: Check 32 bytes per iteration
// Typical: <100ns for 4KB page
}

Memory Statistics

On typical systems:

Memory TypeSame-Fill %
Idle desktop60-70%
Active workload35-45%
Database server25-35%
Compilation40-50%

CompressorBuilder API

The CompressorBuilder provides a fluent API for configuring compression.

Basic Usage

#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};

let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;
}

Builder Methods

algorithm(algo: Algorithm)

Sets the compression algorithm:

#![allow(unused)]
fn main() {
// LZ4 (fastest)
.algorithm(Algorithm::Lz4)

// ZSTD with compression level
.algorithm(Algorithm::Zstd { level: 1 })
.algorithm(Algorithm::Zstd { level: 3 })

// Adaptive (auto-select based on entropy)
.algorithm(Algorithm::Adaptive)
}

prefer_backend(backend: SimdBackend)

Forces a specific SIMD backend:

#![allow(unused)]
fn main() {
use trueno_zram_core::SimdBackend;

// Force scalar (no SIMD)
.prefer_backend(SimdBackend::Scalar)

// Force AVX2
.prefer_backend(SimdBackend::Avx2)

// Force AVX-512
.prefer_backend(SimdBackend::Avx512)
}

build()

Creates the compressor:

#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?; // Returns Result<Compressor, Error>
}

Compressor Methods

compress(&self, page: &[u8; PAGE_SIZE]) -> Result<CompressedPage>

Compresses a single page:

#![allow(unused)]
fn main() {
let page = [0u8; PAGE_SIZE];
let compressed = compressor.compress(&page)?;

println!("Size: {} bytes", compressed.data.len());
println!("Ratio: {:.2}x", compressed.ratio());
}

decompress(&self, page: &CompressedPage) -> Result<[u8; PAGE_SIZE]>

Decompresses a page:

#![allow(unused)]
fn main() {
let decompressed = compressor.decompress(&compressed)?;
assert_eq!(page, decompressed);
}

stats(&self) -> CompressionStats

Returns compression statistics:

#![allow(unused)]
fn main() {
let stats = compressor.stats();

println!("Pages: {}", stats.pages);
println!("Bytes in: {}", stats.bytes_in);
println!("Bytes out: {}", stats.bytes_out);
println!("Ratio: {:.2}x", stats.ratio());
println!("Compress time: {} ns", stats.compress_time_ns);
println!("Decompress time: {} ns", stats.decompress_time_ns);
println!("Throughput: {:.2} GB/s", stats.throughput_gbps());
}

reset_stats(&mut self)

Resets statistics counters:

#![allow(unused)]
fn main() {
compressor.reset_stats();
}

backend(&self) -> SimdBackend

Returns the active SIMD backend:

#![allow(unused)]
fn main() {
println!("Backend: {:?}", compressor.backend());
// Output: Avx512, Avx2, Neon, or Scalar
}

CompressedPage

The compressed page structure:

#![allow(unused)]
fn main() {
pub struct CompressedPage {
    /// Compressed data
    pub data: Vec<u8>,

    /// Original size (always PAGE_SIZE)
    pub original_size: usize,

    /// Algorithm used
    pub algorithm: Algorithm,
}

impl CompressedPage {
    /// Compression ratio
    pub fn ratio(&self) -> f64;

    /// Bytes saved
    pub fn bytes_saved(&self) -> usize;

    /// Check if actually compressed
    pub fn is_compressed(&self) -> bool;
}
}

Error Handling

#![allow(unused)]
fn main() {
use trueno_zram_core::Error;

match compressor.compress(&page) {
    Ok(compressed) => { /* success */ }
    Err(Error::BufferTooSmall(msg)) => { /* buffer issue */ }
    Err(Error::CorruptedData(msg)) => { /* corrupt input */ }
    Err(Error::InvalidInput(msg)) => { /* invalid params */ }
    Err(e) => { /* other error */ }
}
}

GPU Batch API

The GPU batch API provides high-throughput compression for large page batches.

GpuBatchConfig

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::GpuBatchConfig;
use trueno_zram_core::Algorithm;

let config = GpuBatchConfig {
    device_index: 0,        // CUDA device (0 = first GPU)
    algorithm: Algorithm::Lz4,
    batch_size: 1000,       // Pages per batch
    async_dma: true,        // Enable async transfers
    ring_buffer_slots: 4,   // Pipeline depth
};
}

GpuBatchCompressor

Creation

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{GpuBatchCompressor, GpuBatchConfig};

let config = GpuBatchConfig::default();
let mut compressor = GpuBatchCompressor::new(config)?;
}

Batch Compression

#![allow(unused)]
fn main() {
let pages: Vec<[u8; PAGE_SIZE]> = vec![[0u8; PAGE_SIZE]; 1000];
let result = compressor.compress_batch(&pages)?;
}

Statistics

#![allow(unused)]
fn main() {
let stats = compressor.stats();

println!("Pages compressed: {}", stats.pages_compressed);
println!("Input bytes: {}", stats.total_bytes_in);
println!("Output bytes: {}", stats.total_bytes_out);
println!("Time: {} ns", stats.total_time_ns);
println!("Ratio: {:.2}x", stats.compression_ratio());
println!("Throughput: {:.2} GB/s", stats.throughput_gbps());
}

Configuration Access

#![allow(unused)]
fn main() {
let config = compressor.config();
println!("Batch size: {}", config.batch_size);
println!("Async DMA: {}", config.async_dma);
}

BatchResult

#![allow(unused)]
fn main() {
pub struct BatchResult {
    /// Compressed pages
    pub pages: Vec<CompressedPage>,

    /// Host-to-device transfer time (ns)
    pub h2d_time_ns: u64,

    /// Kernel execution time (ns)
    pub kernel_time_ns: u64,

    /// Device-to-host transfer time (ns)
    pub d2h_time_ns: u64,

    /// Total wall clock time (ns)
    pub total_time_ns: u64,
}
}

Methods

#![allow(unused)]
fn main() {
// Throughput in bytes/second
let throughput = result.throughput_bytes_per_sec(input_bytes);

// Compression ratio
let ratio = result.compression_ratio();

// Check PCIe 5x rule
let beneficial = result.pcie_rule_satisfied();
}

Helper Functions

gpu_available()

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::gpu_available;

if gpu_available() {
    println!("CUDA GPU detected");
}
}

select_backend()

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{select_backend, BackendSelection};

let backend = select_backend(batch_size, gpu_available());

match backend {
    BackendSelection::Gpu => { /* use GPU */ }
    BackendSelection::Simd => { /* use CPU SIMD */ }
    BackendSelection::Scalar => { /* use scalar */ }
}
}

meets_pcie_rule()

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::meets_pcie_rule;

let pages = 10000;
let pcie_bandwidth = 64.0;  // GB/s (PCIe 5.0)
let gpu_throughput = 500.0; // GB/s

if meets_pcie_rule(pages, pcie_bandwidth, gpu_throughput) {
    println!("GPU offload beneficial");
}
}

GpuDeviceInfo

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::GpuDeviceInfo;

let info = GpuDeviceInfo {
    index: 0,
    name: "RTX 4090".to_string(),
    total_memory: 24 * 1024 * 1024 * 1024,
    l2_cache_size: 72 * 1024 * 1024,
    compute_capability: (8, 9),
    backend: GpuBackend::Cuda,
};

println!("Optimal batch: {} pages", info.optimal_batch_size());
println!("Supported: {}", info.is_supported());
}

Kernel Compatibility API

The compat module provides sysfs interface compatibility with kernel zram.

SysfsInterface

Emulates the kernel zram sysfs interface:

#![allow(unused)]
fn main() {
use trueno_zram_core::compat::SysfsInterface;

let mut interface = SysfsInterface::new();

// Configure like kernel zram
interface.write_attr("disksize", "4G")?;
interface.write_attr("comp_algorithm", "lz4")?;
interface.write_attr("mem_limit", "2G")?;

// Read attributes
let disksize = interface.read_attr("disksize")?;
let algorithm = interface.read_attr("comp_algorithm")?;
}

Supported Attributes

AttributeReadWriteDescription
disksizeYesYesDisk size in bytes
comp_algorithmYesYesCompression algorithm
mem_limitYesYesMemory limit
mem_used_maxYesNoPeak memory usage
mem_used_totalYesNoCurrent memory usage
orig_data_sizeYesNoOriginal data size
compr_data_sizeYesNoCompressed data size
num_readsYesNoRead count
num_writesYesNoWrite count
invalid_ioYesNoInvalid I/O count
notify_freeYesNoFree notifications
resetNoYesReset device

Algorithm Support

#![allow(unused)]
fn main() {
use trueno_zram_core::compat::{ZramAlgorithm, is_algorithm_supported};

// Check algorithm support
assert!(is_algorithm_supported("lz4"));
assert!(is_algorithm_supported("zstd"));
assert!(is_algorithm_supported("lzo"));  // Compatibility alias

// Parse algorithm
let algo = "lz4".parse::<ZramAlgorithm>()?;
}

Supported Algorithms

NameAliasDescription
lz4-LZ4 fast compression
lz4hc-LZ4 high compression
zstd-Zstandard
lzolzo-rleLZO (mapped to LZ4)
842deflate842/deflate (mapped to ZSTD)

Statistics

MmStat

Memory statistics (like /sys/block/zram0/mm_stat):

#![allow(unused)]
fn main() {
use trueno_zram_core::compat::MmStat;

let stat = interface.mm_stat();

println!("Original size: {} bytes", stat.orig_data_size);
println!("Compressed size: {} bytes", stat.compr_data_size);
println!("Memory used: {} bytes", stat.mem_used_total);
println!("Memory limit: {} bytes", stat.mem_limit);
println!("Memory max: {} bytes", stat.mem_used_max);
println!("Same pages: {}", stat.same_pages);
println!("Pages stored: {}", stat.pages_compacted);
println!("Huge pages: {}", stat.huge_pages);
}

IoStat

I/O statistics (like /sys/block/zram0/io_stat):

#![allow(unused)]
fn main() {
use trueno_zram_core::compat::IoStat;

let stat = interface.io_stat();

println!("Reads: {}", stat.num_reads);
println!("Writes: {}", stat.num_writes);
println!("Invalid I/O: {}", stat.invalid_io);
println!("Notify free: {}", stat.notify_free);
}

Device Reset

#![allow(unused)]
fn main() {
// Reset all statistics and data
interface.write_attr("reset", "1")?;
}

Integration Example

#![allow(unused)]
fn main() {
use trueno_zram_core::compat::SysfsInterface;
use trueno_zram_core::PAGE_SIZE;

let mut interface = SysfsInterface::new();

// Configure
interface.write_attr("disksize", "1G")?;
interface.write_attr("comp_algorithm", "lz4")?;

// Simulate writes
let page = [0xAA; PAGE_SIZE];
interface.write_page(0, &page)?;
interface.write_page(1, &page)?;

// Check statistics
let mm = interface.mm_stat();
println!("Compression ratio: {:.2}x",
    mm.orig_data_size as f64 / mm.compr_data_size as f64);
}

Benchmarks

Running Benchmarks

# Criterion benchmarks
cargo bench --all-features

# With baseline comparison
cargo bench --all-features -- --save-baseline main

# Example benchmarks
cargo run -p trueno-zram-core --example compress_benchmark --release

Results Summary

LZ4 Performance

BackendCompressDecompressRatio
AVX-5124.4 GB/s5.4 GB/s2-4x
AVX23.2 GB/s4.1 GB/s2-4x
Scalar3.0 GB/s3.8 GB/s2-4x

ZSTD Performance

BackendLevelCompressDecompressRatio
AVX-512111.2 GB/s46 GB/s3-5x
AVX-51238.5 GB/s45 GB/s4-6x
AVX218.5 GB/s35 GB/s3-5x

Same-Fill Performance

BackendDetectionRatio
AVX-51222 GB/s2048:1
AVX218 GB/s2048:1
Scalar12 GB/s2048:1

Data Patterns

Zeros (Best Case)

Pattern: Zeros (100% same-fill)
Pages: 1000
Compression: 22 GB/s
Decompression: 46 GB/s
Ratio: 2048:1

Text (Compressible)

Pattern: Text/Code
Pages: 1000
LZ4 Compression: 4.4 GB/s
LZ4 Decompression: 5.4 GB/s
Ratio: 3.2:1

Random (Incompressible)

Pattern: Random bytes
Pages: 1000
LZ4 Compression: 1.6 GB/s
LZ4 Decompression: 32 GB/s
Ratio: 1.0:1 (pass-through)

GPU Benchmarks

Note (2026-01-06): GPU decompression is limited by PCIe transfer overhead. CPU parallel path (50+ GB/s) is faster for most workloads.

RTX 4090 (Validated)

PathThroughputNotes
CPU Parallel50+ GB/sPrimary recommended path
GPU End-to-End~6 GB/sPCIe 4.0 transfer bottleneck
GPU Kernel-only~9 GB/sWithout H2D/D2H transfers

Recommendation

Use CPU parallel decompression for best performance. GPU useful for:

  • Future PCIe 5.0+ systems with higher bandwidth
  • Workloads where CPU is saturated

Latency

OperationP50P99P99.9
LZ4 compress (4KB)45us85us120us
LZ4 decompress (4KB)38us72us95us
Same-fill detect8us15us25us

Memory Usage

ComponentMemory
Hash table (LZ4)64 KB
Working buffer16 KB
ZSTD context256 KB

Comparison with Linux Kernel zram

Validated 2026-01-06 on RTX 4090 + AMD Threadripper 7960X (AVX-512)

Block Device I/O (fio, Direct I/O)

MetricKernel ZRAMtrueno-ublkSpeedup
Sequential Read9.2 GB/s16.5 GB/s1.8x
Random 4K IOPS55K249K4.5x
Compression Ratio2.5x3.87x+55%

Compression Engine (cargo examples)

MetricLinux Kernel zramtrueno-zramSpeedup
Compress (parallel)3-5 GB/s20-30 GB/s5-6x
Decompress (CPU)~10 GB/s50+ GB/s5x
Same-fill detection~8 GB/s22 GB/s2.75x
Compression ratio2.5x3.87x+55%

Architecture Difference

AspectLinux Kerneltrueno-zram
ThreadingSingle-threaded per pageParallel (rayon)
SIMDLimitedAVX-512/AVX2/NEON
Batch processingNoYes (5000+ pages)
GPU offloadNoOptional CUDA

Falsification Testing (2026-01-06)

Test Configuration:
├── Hardware: AMD Threadripper 7960X, 125GB RAM, RTX 4090
├── Device: trueno-ublk 8GB device
├── Tool: fio, cargo examples

Results:
├── Sequential I/O: 16.5 GB/s ✓ (claim: 12.5 GB/s)
├── Random IOPS: 249K ✓ (claim: 228K)
├── Compression: 30.66 GB/s ✓ (claim: 20-24 GB/s)
├── Ratio: 3.87x ✓ (claim: 3.7x)
├── mlock: 272 MB ✓ (claim: >100 MB)
├── Stress test: No deadlock ✓
└── Status: 6/8 PASS (GPU claims deprecated)

Why trueno-zram is Faster

  1. SIMD vectorization: AVX-512 processes 64 bytes per instruction vs byte-by-byte
  2. Parallel compression: All CPU cores compress simultaneously via rayon
  3. Batch amortization: Setup costs spread across thousands of pages
  4. Cache efficiency: Sequential memory access patterns
  5. Zero-copy paths: Same-fill pages detected without compression attempt

Tuning Guide

Algorithm Selection

LZ4 vs ZSTD

WorkloadRecommendedReason
Real-timeLZ4Lowest latency
Memory-constrainedZSTDBetter ratio
Mixed contentAdaptiveAuto-selects
Highly compressibleZSTD-3Best ratio

Code Example

#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};

// For latency-sensitive workloads
let fast = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;

// For memory-constrained systems
let compact = CompressorBuilder::new()
    .algorithm(Algorithm::Zstd { level: 3 })
    .build()?;

// For unknown workloads
let adaptive = CompressorBuilder::new()
    .algorithm(Algorithm::Adaptive)
    .build()?;
}

SIMD Backend Selection

The library auto-detects the best backend, but you can force one:

#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, SimdBackend};

// Force AVX2 (e.g., for testing)
let compressor = CompressorBuilder::new()
    .prefer_backend(SimdBackend::Avx2)
    .build()?;
}

GPU Batch Sizing

Optimal Batch Size

GPUL2 CacheOptimal Batch
H10050 MB10,240 pages
A10040 MB8,192 pages
RTX 409072 MB14,745 pages
RTX 30906 MB1,200 pages

Calculation

#![allow(unused)]
fn main() {
fn optimal_batch_size(l2_cache_bytes: usize) -> usize {
    // Each page needs ~3KB working memory
    // Target 80% L2 cache utilization
    (l2_cache_bytes * 80 / 100) / (3 * 1024)
}
}

Async DMA

Enable async DMA for overlapping transfers:

#![allow(unused)]
fn main() {
let config = GpuBatchConfig {
    async_dma: true,
    ring_buffer_slots: 4,  // Pipeline depth
    ..Default::default()
};
}

Benefits:

  • Overlaps H2D, compute, D2H
  • 20-30% throughput improvement
  • Higher GPU utilization

Memory Configuration

Working Memory

#![allow(unused)]
fn main() {
// Per-thread memory usage
const LZ4_HASH_TABLE: usize = 64 * 1024;   // 64 KB
const ZSTD_CONTEXT: usize = 256 * 1024;     // 256 KB
const WORKING_BUFFER: usize = 16 * 1024;    // 16 KB
}

Reducing Memory

For memory-constrained systems:

#![allow(unused)]
fn main() {
// Use LZ4 (smaller context)
.algorithm(Algorithm::Lz4)

// Smaller batch sizes
let config = GpuBatchConfig {
    batch_size: 100,
    ..Default::default()
};
}

Monitoring Performance

#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;

// Process pages...

let stats = compressor.stats();

// Check throughput
if stats.throughput_gbps() < 3.0 {
    println!("Warning: Low throughput, check CPU affinity");
}

// Check ratio
if stats.ratio() < 1.5 {
    println!("Warning: Low compression, data may be incompressible");
}
}

CPU Affinity

For best performance, pin compression threads to physical cores:

# Linux: Pin to cores 0-3
taskset -c 0-3 ./my_app

# Check NUMA topology
numactl --hardware

Kernel Parameters

For zram integration:

# Increase zram size
echo $((8 * 1024 * 1024 * 1024)) > /sys/block/zram0/disksize

# Set compression streams
echo 4 > /sys/block/zram0/max_comp_streams

# Enable writeback
echo 1 > /sys/block/zram0/writeback

PCIe 5x Rule

The PCIe 5x rule determines when GPU offload is beneficial for compression.

The Rule

GPU beneficial when: T_compute > 5 × T_transfer

Where:

  • T_compute = CPU computation time
  • T_transfer = PCIe transfer time (H2D + D2H)

Why 5x?

GPU offload has overhead:

  1. H2D transfer: Copy data to GPU
  2. Kernel launch: ~5-10us overhead
  3. D2H transfer: Copy results back
  4. Synchronization: Wait for completion

The 5x factor accounts for these overheads and ensures GPU provides net benefit.

Calculation

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::meets_pcie_rule;
use trueno_zram_core::PAGE_SIZE;

fn check_gpu_benefit(
    pages: usize,
    pcie_bandwidth_gbps: f64,
    gpu_throughput_gbps: f64,
) -> bool {
    let data_bytes = pages * PAGE_SIZE;

    // Transfer time (round trip)
    let transfer_time = 2.0 * data_bytes as f64 / (pcie_bandwidth_gbps * 1e9);

    // GPU compute time
    let gpu_time = data_bytes as f64 / (gpu_throughput_gbps * 1e9);

    // CPU compute time (assume 4 GB/s baseline)
    let cpu_time = data_bytes as f64 / (4e9);

    // GPU beneficial if saves time
    cpu_time > (transfer_time + gpu_time) * 1.2  // 20% margin
}
}

Examples

PCIe 4.0 x16 (25 GB/s)

BatchData SizeTransferGPU TimeBeneficial?
100400 KB32 us4 usNo
1,0004 MB320 us40 usMarginal
10,00040 MB3.2 ms400 usYes
100,000400 MB32 ms4 msYes

PCIe 5.0 x16 (64 GB/s)

BatchData SizeTransferGPU TimeBeneficial?
1,0004 MB125 us40 usNo
10,00040 MB1.25 ms400 usYes
100,000400 MB12.5 ms4 msYes

Checking in Code

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{GpuBatchCompressor, GpuBatchConfig};

let mut compressor = GpuBatchCompressor::new(config)?;
let result = compressor.compress_batch(&pages)?;

if result.pcie_rule_satisfied() {
    println!("GPU offload was beneficial");
    println!("Kernel time: {} ns", result.kernel_time_ns);
    println!("Transfer time: {} ns",
        result.h2d_time_ns + result.d2h_time_ns);
} else {
    println!("Consider using CPU for this batch size");
}
}

Backend Selection

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{select_backend, BackendSelection, gpu_available};

let batch_size = 5000;
let has_gpu = gpu_available();

match select_backend(batch_size, has_gpu) {
    BackendSelection::Gpu => {
        // Use GPU batch compression
    }
    BackendSelection::Simd => {
        // Use CPU SIMD compression
    }
    BackendSelection::Scalar => {
        // Use scalar compression
    }
}
}

Optimization Tips

  1. Batch larger: Combine small batches into larger ones
  2. Use async DMA: Overlap transfers with computation
  3. Profile first: Measure actual transfer times
  4. Consider hybrid: Use CPU for small batches, GPU for large

Hardware-Specific Thresholds

GPUPCIeMin Beneficial Batch
H1005.0 x165,000 pages
A1004.0 x168,000 pages
RTX 40904.0 x1610,000 pages
RTX 30904.0 x1612,000 pages

Design Overview

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Public API                           │
│  CompressorBuilder, Algorithm, CompressedPage           │
├─────────────────────────────────────────────────────────┤
│                 Algorithm Selection                      │
│  ┌─────────┐  ┌─────────┐  ┌──────────┐  ┌─────────┐   │
│  │   LZ4   │  │  ZSTD   │  │ Adaptive │  │Samefill │   │
│  └────┬────┘  └────┬────┘  └────┬─────┘  └────┬────┘   │
├───────┼────────────┼────────────┼─────────────┼────────┤
│       │     SIMD Dispatch       │             │         │
│  ┌────▼────┐  ┌────▼────┐  ┌───▼───┐  ┌─────▼─────┐   │
│  │ AVX-512 │  │  AVX2   │  │ NEON  │  │  Scalar   │   │
│  └─────────┘  └─────────┘  └───────┘  └───────────┘   │
├─────────────────────────────────────────────────────────┤
│                   GPU Backend (Optional)                 │
│  ┌─────────────────────────────────────────────────┐   │
│  │  CUDA Batch Compressor (trueno-gpu PTX)         │   │
│  │  ├── H2D Transfer                               │   │
│  │  ├── Warp-Cooperative LZ4 Kernel                │   │
│  │  └── D2H Transfer                               │   │
│  └─────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘

Crate Structure

trueno-zram/
├── crates/
│   ├── trueno-zram-core/     # Core compression library
│   │   ├── src/
│   │   │   ├── lib.rs        # Public API
│   │   │   ├── error.rs      # Error types
│   │   │   ├── page.rs       # CompressedPage
│   │   │   ├── lz4/          # LZ4 implementation
│   │   │   ├── zstd/         # ZSTD implementation
│   │   │   ├── gpu/          # GPU batch compression
│   │   │   ├── simd/         # SIMD detection/dispatch
│   │   │   ├── samefill.rs   # Same-fill detection
│   │   │   ├── compat.rs     # Kernel compatibility
│   │   │   └── benchmark.rs  # Benchmarking utilities
│   │   └── examples/
│   ├── trueno-zram-adaptive/ # ML-driven selection
│   ├── trueno-zram-generator/# systemd integration
│   └── trueno-zram-cli/      # CLI tool
└── bins/
    └── trueno-ublk/          # ublk daemon

Key Design Decisions

1. Runtime SIMD Dispatch

CPU features are detected at runtime, not compile time:

#![allow(unused)]
fn main() {
// Detection happens once at startup
let features = simd::detect();

// Dispatch based on available features
if features.has_avx512() {
    lz4::avx512::compress(input, output)
} else if features.has_avx2() {
    lz4::avx2::compress(input, output)
} else {
    lz4::scalar::compress(input, output)
}
}

2. Page-Based Compression

All compression operates on fixed 4KB pages:

#![allow(unused)]
fn main() {
pub const PAGE_SIZE: usize = 4096;

// This is enforced at the type level
pub fn compress(page: &[u8; PAGE_SIZE]) -> Result<CompressedPage>;
}

3. Builder Pattern

Configuration via builder pattern:

#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .prefer_backend(SimdBackend::Avx512)
    .build()?;
}

4. Trait-Based Abstraction

The PageCompressor trait enables polymorphism:

#![allow(unused)]
fn main() {
pub trait PageCompressor {
    fn compress(&self, page: &[u8; PAGE_SIZE]) -> Result<CompressedPage>;
    fn decompress(&self, page: &CompressedPage) -> Result<[u8; PAGE_SIZE]>;
}
}

5. Zero-Copy Where Possible

Minimize allocations in hot paths:

#![allow(unused)]
fn main() {
// Output buffer passed in, not allocated
fn compress_into(input: &[u8], output: &mut [u8]) -> Result<usize>;
}

6. No Panics in Library Code

All errors are returned as Result:

#![allow(unused)]
#![deny(clippy::panic)]
#![deny(clippy::unwrap_used)]
fn main() {
}

Dependencies

CratePurpose
thiserrorError derive macros
cudarcCUDA driver bindings
rayonParallel iteration
trueno-gpuPure Rust PTX generation

Feature Flags

FlagDefaultDescription
stdYesStandard library
nightlyNoNightly SIMD features
cudaNoCUDA GPU support

SIMD Dispatch

Overview

trueno-zram uses runtime CPU feature detection to select the optimal SIMD implementation.

Detection

#![allow(unused)]
fn main() {
// src/simd/detect.rs

pub fn detect() -> SimdFeatures {
    #[cfg(target_arch = "x86_64")]
    {
        SimdFeatures {
            avx512: is_x86_feature_detected!("avx512f")
                && is_x86_feature_detected!("avx512bw")
                && is_x86_feature_detected!("avx512vl"),
            avx2: is_x86_feature_detected!("avx2"),
            sse42: is_x86_feature_detected!("sse4.2"),
        }
    }

    #[cfg(target_arch = "aarch64")]
    {
        SimdFeatures {
            neon: true, // Always available on AArch64
        }
    }
}
}

Dispatch Pattern

#![allow(unused)]
fn main() {
// src/simd/dispatch.rs

pub fn compress(input: &[u8], output: &mut [u8]) -> Result<usize> {
    let features = detect();

    #[cfg(target_arch = "x86_64")]
    {
        if features.has_avx512() {
            return unsafe { avx512::compress(input, output) };
        }
        if features.has_avx2() {
            return unsafe { avx2::compress(input, output) };
        }
    }

    #[cfg(target_arch = "aarch64")]
    {
        if features.has_neon() {
            return unsafe { neon::compress(input, output) };
        }
    }

    scalar::compress(input, output)
}
}

Implementation Structure

Each algorithm has separate implementations:

lz4/
├── mod.rs          # Public API, dispatch
├── compress.rs     # Core algorithm logic
├── decompress.rs   # Decompression
├── avx512.rs       # AVX-512 specialization
├── avx2.rs         # AVX2 specialization
├── neon.rs         # ARM NEON specialization
└── scalar.rs       # Fallback implementation

AVX-512 Implementation

#![allow(unused)]
fn main() {
// lz4/avx512.rs

#[target_feature(enable = "avx512f,avx512bw,avx512vl")]
pub unsafe fn compress(input: &[u8], output: &mut [u8]) -> Result<usize> {
    // 64-byte hash table lookups
    let hash_chunk = _mm512_loadu_si512(input.as_ptr());

    // Parallel match finding
    let matches = _mm512_cmpeq_epi32(hash_chunk, pattern);

    // Wide literal copies
    _mm512_storeu_si512(output.as_mut_ptr(), data);

    // ...
}
}

AVX2 Implementation

#![allow(unused)]
fn main() {
// lz4/avx2.rs

#[target_feature(enable = "avx2")]
pub unsafe fn compress(input: &[u8], output: &mut [u8]) -> Result<usize> {
    // 32-byte operations
    let chunk = _mm256_loadu_si256(input.as_ptr());

    // Match extension
    let cmp = _mm256_cmpeq_epi8(src, dst);
    let mask = _mm256_movemask_epi8(cmp);

    // ...
}
}

NEON Implementation

#![allow(unused)]
fn main() {
// lz4/neon.rs

#[cfg(target_arch = "aarch64")]
pub unsafe fn compress(input: &[u8], output: &mut [u8]) -> Result<usize> {
    use std::arch::aarch64::*;

    // 16-byte operations
    let chunk = vld1q_u8(input.as_ptr());

    // Parallel comparison
    let cmp = vceqq_u8(src, dst);

    // ...
}
}

Benchmarking Dispatch

#![allow(unused)]
fn main() {
use trueno_zram_core::simd::{detect, SimdBackend};

fn benchmark_all_backends() {
    let features = detect();
    let input = [0xAA; PAGE_SIZE];
    let mut output = [0u8; PAGE_SIZE * 2];

    // Benchmark available backends
    if features.has_avx512() {
        bench("AVX-512", || avx512::compress(&input, &mut output));
    }
    if features.has_avx2() {
        bench("AVX2", || avx2::compress(&input, &mut output));
    }
    bench("Scalar", || scalar::compress(&input, &mut output));
}
}

Compile-Time Optimization

For maximum performance, enable target features:

# .cargo/config.toml
[build]
rustflags = ["-C", "target-cpu=native"]

Or for specific features:

rustflags = ["-C", "target-feature=+avx2,+avx512f,+avx512bw"]

GPU Pipeline

Overview

The GPU pipeline uses CUDA for batch compression with async DMA.

┌────────────┐    ┌────────────┐    ┌────────────┐    ┌────────────┐
│   Host     │───▶│    H2D     │───▶│   Kernel   │───▶│    D2H     │
│   Memory   │    │  Transfer  │    │  Execution │    │  Transfer  │
└────────────┘    └────────────┘    └────────────┘    └────────────┘
      ▲                                                      │
      └──────────────────────────────────────────────────────┘

Pipeline Stages

1. Host-to-Device Transfer

#![allow(unused)]
fn main() {
fn transfer_to_device(&self, pages: &[[u8; PAGE_SIZE]]) -> Result<CudaSlice<u8>> {
    // Flatten pages into contiguous buffer
    let total_bytes = pages.len() * PAGE_SIZE;
    let mut flat_data = Vec::with_capacity(total_bytes);
    for page in pages {
        flat_data.extend_from_slice(page);
    }

    // Async copy to device
    let device_buffer = self.stream.clone_htod(&flat_data)?;
    Ok(device_buffer)
}
}

2. Kernel Execution

#![allow(unused)]
fn main() {
fn execute_kernel(&self, input: &CudaSlice<u8>, batch_size: u32) -> Result<CudaSlice<u8>> {
    // Allocate output buffer
    let mut output = self.stream.alloc_zeros::<u8>(batch_size as usize * PAGE_SIZE)?;
    let mut sizes = self.stream.alloc_zeros::<u32>(batch_size as usize)?;

    // Launch kernel
    // Grid: ceil(batch_size / 4) blocks
    // Block: 128 threads (4 warps)
    unsafe {
        self.stream
            .launch_builder(&self.kernel_fn)
            .arg(&input)
            .arg(&mut output)
            .arg(&mut sizes)
            .arg(&batch_size)
            .launch(cfg)?;
    }

    self.stream.synchronize()?;
    Ok(output)
}
}

3. Device-to-Host Transfer

#![allow(unused)]
fn main() {
fn transfer_from_device(&self, data: CudaSlice<u8>) -> Result<Vec<CompressedPage>> {
    let output = self.stream.clone_dtoh(&data)?;

    // Convert to CompressedPage structures
    // ...
}
}

Warp-Cooperative Kernel

The LZ4 kernel uses warp-cooperative compression:

Block (128 threads)
├── Warp 0 (32 threads) → Page 0
├── Warp 1 (32 threads) → Page 1
├── Warp 2 (32 threads) → Page 2
└── Warp 3 (32 threads) → Page 3

Shared Memory Layout

Shared Memory (48 KB per block)
├── Warp 0: 12 KB (hash table + working)
├── Warp 1: 12 KB
├── Warp 2: 12 KB
└── Warp 3: 12 KB

PTX Generation

Using trueno-gpu for pure Rust PTX:

#![allow(unused)]
fn main() {
use trueno_gpu::kernels::lz4::Lz4WarpCompressKernel;
use trueno_gpu::kernels::Kernel;

let kernel = Lz4WarpCompressKernel::new(65536);
let ptx_string = kernel.emit_ptx();

// Load PTX into CUDA module
let ptx = Ptx::from(ptx_string);
let module = context.load_module(ptx)?;
}

Async DMA Ring Buffer

#![allow(unused)]
fn main() {
struct AsyncPipeline {
    slots: Vec<PipelineSlot>,
    head: usize,
    tail: usize,
}

struct PipelineSlot {
    input_buffer: CudaSlice<u8>,
    output_buffer: CudaSlice<u8>,
    event: CudaEvent,
    state: SlotState,
}

enum SlotState {
    Free,
    H2DInProgress,
    KernelInProgress,
    D2HInProgress,
    Complete,
}
}

Pipelining

Time ──────────────────────────────────────────────▶

Slot 0: [H2D][Kernel][D2H]
Slot 1:      [H2D][Kernel][D2H]
Slot 2:           [H2D][Kernel][D2H]
Slot 3:                [H2D][Kernel][D2H]

Performance Optimization

1. Pinned Memory

#![allow(unused)]
fn main() {
// Use pinned (page-locked) memory for faster transfers
let pinned_buffer = cuda_malloc_host(size)?;
}

2. Stream Overlap

#![allow(unused)]
fn main() {
// Use separate streams for H2D, compute, D2H
let h2d_stream = context.create_stream()?;
let compute_stream = context.create_stream()?;
let d2h_stream = context.create_stream()?;
}

3. Kernel Occupancy

#![allow(unused)]
fn main() {
// Optimal: 4 warps per block = 128 threads
// Shared memory: 48 KB (12 KB per warp)
// Registers: ~32 per thread
}

Error Handling

#![allow(unused)]
fn main() {
match kernel_result {
    Ok(output) => process_output(output),
    Err(CudaError::IllegalAddress) => {
        // Memory access violation
        fallback_to_cpu()?
    }
    Err(CudaError::LaunchFailed) => {
        // Kernel launch failed
        fallback_to_cpu()?
    }
    Err(e) => Err(Error::GpuNotAvailable(e.to_string())),
}
}

trueno-ublk Overview

trueno-ublk is a GPU-accelerated ZRAM replacement that uses the Linux ublk interface to provide a high-performance compressed block device in userspace.

Production Status (2026-01-06)

MILESTONE DT-005 ACHIEVED: trueno-ublk is running as system swap!

  • 8GB device active as primary swap (priority 150)
  • CPU SIMD compression at 20-30 GB/s with 3.87x ratio
  • CPU parallel decompression at 50+ GB/s

DT-007 COMPLETED: Swap deadlock issue FIXED via mlock() - daemon memory pinned.

Known Limitations:

  • Docker cannot isolate ublk devices (host kernel resources)

What is ublk?

ublk (userspace block device) is a Linux kernel interface that allows implementing block devices entirely in userspace. It uses io_uring for efficient I/O handling, avoiding the overhead of kernel context switches.

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Applications                          │
├─────────────────────────────────────────────────────────┤
│                    Filesystem                            │
├─────────────────────────────────────────────────────────┤
│                /dev/ublkbN block device                  │
├─────────────────────────────────────────────────────────┤
│                    Linux Kernel                          │
│                   (ublk driver)                          │
├─────────────────────────────────────────────────────────┤
│                    io_uring                              │
├─────────────────────────────────────────────────────────┤
│                  trueno-ublk                             │
│  ┌───────────────────────────────────────────────────┐  │
│  │              Entropy Router                        │  │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────────────┐   │  │
│  │  │  GPU    │  │  SIMD   │  │     Scalar      │   │  │
│  │  │ (batch) │  │ (AVX2)  │  │ (incompressible)│   │  │
│  │  └─────────┘  └─────────┘  └─────────────────┘   │  │
│  └───────────────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────┐  │
│  │          trueno-zram-core (LZ4/ZSTD)              │  │
│  └───────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

Features

SIMD-Accelerated Compression

Uses trueno-zram-core for vectorized LZ4/ZSTD compression:

  • AVX2 (256-bit) on modern x86_64
  • AVX-512 (512-bit) on supported CPUs
  • NEON (128-bit) on ARM64

GPU Batch Compression

Offloads compression to CUDA GPUs when beneficial:

  • Warp-cooperative LZ4 kernel
  • PCIe 5x rule evaluation
  • Async DMA for overlap

Entropy-Based Routing

Analyzes data entropy to choose the optimal compression path:

  • Low entropy (< 4.0 bits): GPU batch compression
  • Medium entropy (4-7 bits): SIMD compression
  • High entropy (> 7.0 bits): Scalar or skip compression

Zero-Page Deduplication

Automatically detects and deduplicates all-zero pages, achieving 2048:1 compression for sparse data.

zram-Compatible Statistics

Exports statistics in the same format as kernel zram, enabling drop-in monitoring compatibility.

CLI Usage

# Create a 1TB compressed RAM disk
trueno-ublk create -s 1T -a lz4 --gpu

# List devices
trueno-ublk list

# Show statistics
trueno-ublk stat /dev/ublkb0

# Interactive dashboard
trueno-ublk top

# Analyze data entropy
trueno-ublk entropy /dev/ublkb0

Requirements

  • Linux kernel 6.0+ with ublk support
  • CAP_SYS_ADMIN capability (for device creation)
  • Optional: NVIDIA GPU with CUDA support

Block Device API

trueno-ublk provides a BlockDevice type for creating compressed block devices in pure Rust, without requiring kernel privileges.

Basic Usage

#![allow(unused)]
fn main() {
use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder, PAGE_SIZE};

// Create a compressor
let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;

// Create a 1GB block device
let mut device = BlockDevice::new(1 << 30, compressor);

// Write data (must be page-aligned)
let data = vec![0xAB; PAGE_SIZE];
device.write(0, &data)?;

// Read back
let mut buf = vec![0u8; PAGE_SIZE];
device.read(0, &mut buf)?;
assert_eq!(data, buf);
}

Page Size

All I/O operations must be aligned to PAGE_SIZE (4096 bytes):

#![allow(unused)]
fn main() {
use trueno_zram_core::PAGE_SIZE;

// Write at page boundaries
device.write(0, &data)?;                    // Page 0
device.write(PAGE_SIZE as u64, &data)?;     // Page 1
device.write(2 * PAGE_SIZE as u64, &data)?; // Page 2
}

Statistics

Track compression performance with BlockDeviceStats:

#![allow(unused)]
fn main() {
let stats = device.stats();

println!("Pages stored:      {}", stats.pages_stored);
println!("Bytes written:     {}", stats.bytes_written);
println!("Bytes compressed:  {}", stats.bytes_compressed);
println!("Compression ratio: {:.2}x", stats.compression_ratio());
println!("Zero pages:        {}", stats.zero_pages);
println!("GPU pages:         {}", stats.gpu_pages);
println!("SIMD pages:        {}", stats.simd_pages);
println!("Scalar pages:      {}", stats.scalar_pages);
}

Entropy Threshold

Configure entropy-based routing with with_entropy_threshold:

#![allow(unused)]
fn main() {
// Use threshold of 7.0 bits/byte
let device = BlockDevice::with_entropy_threshold(
    1 << 30,         // 1GB
    compressor,
    7.0,             // Entropy threshold
);
}

Data with entropy above this threshold is routed to the scalar path (assumed incompressible).

Discard Operation

Free pages that are no longer needed:

#![allow(unused)]
fn main() {
// Discard page 0
device.discard(0, PAGE_SIZE as u64)?;

// Reading discarded pages returns zeros
let mut buf = vec![0xFFu8; PAGE_SIZE];
device.read(0, &mut buf)?;
assert!(buf.iter().all(|&b| b == 0));
}

UblkDevice (Kernel Interface)

For creating actual block devices visible to the system, use UblkDevice:

#![allow(unused)]
fn main() {
use trueno_ublk::{UblkDevice, DeviceConfig};
use trueno_zram_core::Algorithm;

let config = DeviceConfig {
    size: 1 << 40,  // 1TB
    algorithm: Algorithm::Lz4,
    streams: 4,
    gpu_enabled: true,
    mem_limit: Some(8 << 30),  // 8GB RAM limit
    backing_dev: None,
    writeback_limit: None,
    entropy_skip_threshold: 7.5,
    gpu_batch_size: 1024,
};

// Requires CAP_SYS_ADMIN
let device = UblkDevice::create(config)?;
println!("Created device: /dev/ublkb{}", device.id());
}

DeviceStats

The DeviceStats struct provides zram-compatible statistics:

#![allow(unused)]
fn main() {
let stats = device.stats();

// mm_stat fields (zram compatible)
println!("Original data:    {} bytes", stats.orig_data_size);
println!("Compressed data:  {} bytes", stats.compr_data_size);
println!("Memory used:      {} bytes", stats.mem_used_total);
println!("Same pages:       {}", stats.same_pages);
println!("Huge pages:       {}", stats.huge_pages);

// io_stat fields
println!("Failed reads:     {}", stats.failed_reads);
println!("Failed writes:    {}", stats.failed_writes);

// trueno-ublk extensions
println!("GPU pages:        {}", stats.gpu_pages);
println!("SIMD pages:       {}", stats.simd_pages);
println!("Throughput:       {:.2} GB/s", stats.throughput_gbps);
println!("Avg entropy:      {:.2} bits", stats.avg_entropy);
println!("SIMD backend:     {}", stats.simd_backend);
}

Entropy Routing

trueno-ublk uses Shannon entropy analysis to route data to the optimal compression backend.

Shannon Entropy

Shannon entropy measures the information density of data in bits per byte. The formula is:

H(X) = -Σ p(x) * log2(p(x))

Where p(x) is the probability of each byte value occurring.

Entropy (bits/byte)Data TypeCompressibility
0.0All same valueExtremely high
1.0-3.0Simple patternsVery high
4.0-6.0Text, codeGood
7.0-7.5Compressed dataPoor
8.0Random/encryptedNone

Routing Strategy

trueno-ublk routes pages based on their entropy:

┌──────────────────────────────────────────────────────────┐
│                    Incoming Page                          │
│                         │                                 │
│                    Calculate Entropy                      │
│                         │                                 │
│           ┌─────────────┼─────────────┐                  │
│           ▼             ▼             ▼                  │
│      entropy < 4.0  4.0 ≤ e ≤ 7.0  entropy > 7.0        │
│           │             │             │                  │
│           ▼             ▼             ▼                  │
│      GPU Batch      SIMD Path    Scalar/Skip            │
│    (highly comp.)  (normal data)  (incompress.)         │
└──────────────────────────────────────────────────────────┘

GPU Batch Path (entropy < 4.0)

  • Used for highly compressible data
  • Benefits from GPU parallelism
  • Batches multiple pages for efficiency
  • Best for: zeros, repeating patterns, sparse data

SIMD Path (4.0 ≤ entropy ≤ 7.0)

  • Used for typical data
  • AVX2/AVX-512/NEON acceleration
  • Single-page processing
  • Best for: text, code, structured data

Scalar/Skip Path (entropy > 7.0)

  • Used for incompressible data
  • Avoids wasting CPU cycles
  • May store uncompressed
  • Best for: encrypted data, already-compressed media

Configuration

Set the entropy threshold when creating a device:

#![allow(unused)]
fn main() {
use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder};

let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;

// Lower threshold = more aggressive SIMD usage
let device = BlockDevice::with_entropy_threshold(
    1 << 30,    // Size
    compressor,
    6.5,        // Custom threshold (default is 7.0)
);
}

Monitoring Entropy

Use the CLI to analyze device entropy:

# Show entropy distribution
trueno-ublk entropy /dev/ublkb0

# Output:
# Entropy Distribution:
#   0.0-2.0 bits: ████████████████████ 45% (GPU)
#   2.0-4.0 bits: ████████ 18% (GPU)
#   4.0-6.0 bits: ██████████ 22% (SIMD)
#   6.0-7.0 bits: ████ 10% (SIMD)
#   7.0-8.0 bits: ██ 5% (Scalar)
#
# Average entropy: 3.2 bits/byte
# Recommended threshold: 7.0

Statistics

Track routing decisions via stats:

#![allow(unused)]
fn main() {
let stats = device.stats();

println!("Routing Statistics:");
println!("  GPU pages:    {} ({:.1}%)",
    stats.gpu_pages,
    100.0 * stats.gpu_pages as f64 / stats.pages_stored as f64);
println!("  SIMD pages:   {} ({:.1}%)",
    stats.simd_pages,
    100.0 * stats.simd_pages as f64 / stats.pages_stored as f64);
println!("  Scalar pages: {} ({:.1}%)",
    stats.scalar_pages,
    100.0 * stats.scalar_pages as f64 / stats.pages_stored as f64);
println!("  Zero pages:   {} ({:.1}%)",
    stats.zero_pages,
    100.0 * stats.zero_pages as f64 / stats.pages_stored as f64);
}

Zero-Page Optimization

All-zero pages receive special handling regardless of entropy:

#![allow(unused)]
fn main() {
// Zero pages are detected and deduplicated
let zeros = vec![0u8; PAGE_SIZE];
device.write(0, &zeros)?;
device.write(PAGE_SIZE as u64, &zeros)?;
device.write(2 * PAGE_SIZE as u64, &zeros)?;

let stats = device.stats();
// All three pages share the same zero-page representation
assert_eq!(stats.zero_pages, 3);
}

This achieves effective 2048:1 compression for sparse data like freshly-allocated memory.

Visualization & Observability

trueno-ublk v3.17.0 integrates with the renacer visualization framework for real-time monitoring, benchmarking, and distributed tracing.

Overview

The visualization system (VIZ-001/002/003/004) provides:

  • Real-time TUI dashboard - Monitor throughput, IOPS, and tier distribution
  • JSON/HTML reports - Export benchmark results for analysis
  • OTLP integration - Distributed tracing to Jaeger/Tempo

CLI Flags

Real-time Visualization (VIZ-002)

# Launch TUI dashboard (requires foreground mode)
sudo trueno-ublk create --size 8G --backend tiered \
    --visualize \
    --foreground

Benchmark Reports (VIZ-003)

# Text output (default)
trueno-ublk benchmark --size 4G --format text

# JSON output (machine-readable)
trueno-ublk benchmark --size 4G --format json > results.json

# HTML report with charts
trueno-ublk benchmark --size 4G --format html -o report.html

# Include ML anomaly detection
trueno-ublk benchmark --size 4G --format json --ml-anomaly

OTLP Tracing (VIZ-004)

# Export traces to Jaeger
sudo trueno-ublk create --size 8G \
    --otlp-endpoint http://localhost:4317 \
    --otlp-service-name trueno-ublk

JSON Schema

The benchmark JSON output follows the trueno-renacer-v1 schema:

{
  "version": "3.17.0",
  "format": "trueno-renacer-v1",
  "benchmark": {
    "workload": "sequential",
    "duration_sec": 60,
    "size_bytes": 4294967296,
    "backend": "tiered",
    "iterations": 3
  },
  "metrics": {
    "throughput_gbps": 7.9,
    "iops": 666000,
    "compression_ratio": 2.8,
    "same_fill_pages": 1048576,
    "tier_distribution": {
      "kernel_zram": 0.65,
      "simd_zstd": 0.25,
      "nvme_direct": 0.05,
      "same_fill": 0.05
    },
    "entropy_histogram": {
      "p50": 4.2,
      "p75": 5.8,
      "p90": 6.9,
      "p95": 7.3,
      "p99": 7.8
    }
  },
  "ml_analysis": {
    "anomalies": [],
    "clusters": 3,
    "silhouette_score": 0.82
  }
}

TruenoCollector API

For programmatic access, use TruenoCollector which implements renacer’s Collector trait:

#![allow(unused)]
fn main() {
use trueno_ublk::visualize::TruenoCollector;
use renacer::visualize::collectors::{Collector, MetricValue};
use std::sync::Arc;

// Create collector from TieredPageStore
let collector = TruenoCollector::new(Arc::clone(&store));

// Collect metrics
let metrics = collector.collect()?;

// Access individual metrics
if let Some(MetricValue::Gauge(throughput)) = metrics.get("throughput_gbps") {
    println!("Throughput: {:.1} GB/s", throughput);
}
}

Metrics Reference

MetricTypeDescription
throughput_gbpsGaugeCurrent I/O throughput in GB/s
iopsRateOperations per second
compression_ratioGaugeOverall compression ratio
pages_totalCounterTotal pages stored
same_fill_pagesCounterPages detected as same-fill
tier_kernel_zram_pctGauge% of pages in kernel ZRAM tier
tier_simd_pctGauge% of pages in SIMD ZSTD tier
tier_skip_pctGauge% of pages skipping compression
tier_samefill_pctGauge% of pages detected as same-fill

Dashboard Panels

When using --visualize, the TUI displays:

PanelDescription
Tier HeatmapReal-time routing decisions
Throughput GaugeCurrent GB/s with history
IOPS CounterOperations per second
Entropy TimelineData compressibility over time
Ratio TrendCompression ratio history

Example: Benchmark Workflow

# 1. Run benchmark with JSON output
trueno-ublk benchmark --size 4G --format json \
    --workload mixed --iterations 5 > bench.json

# 2. Generate HTML report
trueno-ublk benchmark --size 4G --format html \
    --workload mixed -o benchmark-report.html

# 3. Analyze with jq
cat bench.json | jq '.metrics.throughput_gbps'

# 4. Compare algorithms
trueno-ublk benchmark --size 4G --backend memory --format json > lz4.json
trueno-ublk benchmark --size 4G --backend tiered --format json > tiered.json

Integration with Jaeger

For distributed tracing:

# Start Jaeger (all-in-one)
docker run -d --name jaeger \
    -p 4317:4317 \
    -p 16686:16686 \
    jaegertracing/all-in-one:latest

# Create device with OTLP tracing
sudo trueno-ublk create --size 8G \
    --backend tiered \
    --otlp-endpoint http://localhost:4317 \
    --otlp-service-name trueno-swap

# View traces at http://localhost:16686

trueno-ublk Examples

trueno-ublk includes several examples demonstrating its features.

Running Examples

# Basic block device usage
cargo run --example block_device -p trueno-ublk

# Compression statistics comparison
cargo run --example compression_stats -p trueno-ublk

# Entropy-based routing demonstration
cargo run --example entropy_routing -p trueno-ublk

# v3.17.0 Examples
# ----------------

# Visualization demo (VIZ-001/002/003/004)
cargo run --example visualization_demo -p trueno-ublk

# ZSTD vs LZ4 performance comparison (use --release for accurate benchmarks)
cargo run --example zstd_vs_lz4 -p trueno-ublk --release

# Tiered storage architecture demo (KERN-001/002/003)
cargo run --example tiered_storage -p trueno-ublk

# Batched compression benchmark
cargo run --example batched_benchmark -p trueno-ublk --release

Example: Basic Block Device

Demonstrates creating a compressed block device, writing data, and reading it back.

use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder, PAGE_SIZE};

fn main() -> anyhow::Result<()> {
    // Create an LZ4 compressor
    let compressor = CompressorBuilder::new()
        .algorithm(Algorithm::Lz4)
        .build()?;

    // Create a 64MB block device
    let mut device = BlockDevice::new(64 * 1024 * 1024, compressor);

    // Write different patterns
    let compressible = vec![0xAA; PAGE_SIZE];
    device.write(0, &compressible)?;

    let zeros = vec![0u8; PAGE_SIZE];
    device.write(PAGE_SIZE as u64, &zeros)?;

    // Read back and verify
    let mut buf = vec![0u8; PAGE_SIZE];
    device.read(0, &mut buf)?;
    assert_eq!(buf, compressible);

    // Check statistics
    let stats = device.stats();
    println!("Compression ratio: {:.2}x", stats.compression_ratio());

    Ok(())
}

Output:

trueno-ublk Block Device Example
=================================

Created LZ4 compressor with SIMD backend
Created block device: 64 MB

Writing test patterns...
  Page 0: Highly compressible (all 0xAA)
  Page 1: Zero page
  Page 2: Sequential data
  Page 3: Pseudo-random data

Reading back and verifying...
  Page 0: OK
  Page 1: OK
  Page 2: OK
  Page 3: OK

Device Statistics:
  Pages stored:      4
  Bytes written:     16 KB
  Bytes compressed:  0 KB
  Compression ratio: 27.86x
  Zero pages:        1

All tests passed!

Example: Compression Statistics

Compares LZ4 and ZSTD compression on various data patterns.

use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder, PAGE_SIZE};

fn main() -> anyhow::Result<()> {
    let test_data = vec![
        ("All zeros", vec![0u8; PAGE_SIZE]),
        ("Repeating pattern", (0..PAGE_SIZE).map(|i| (i % 4) as u8).collect()),
        ("High entropy", (0..PAGE_SIZE).map(|i| ((i * 17 + 31) % 256) as u8).collect()),
    ];

    for algorithm in [Algorithm::Lz4, Algorithm::Zstd { level: 3 }] {
        let compressor = CompressorBuilder::new()
            .algorithm(algorithm)
            .build()?;

        let mut device = BlockDevice::new(64 * 1024 * 1024, compressor);

        for (i, (_, data)) in test_data.iter().enumerate() {
            device.write((i * PAGE_SIZE) as u64, data)?;
        }

        let stats = device.stats();
        println!("{:?}: {:.2}x compression", algorithm, stats.compression_ratio());
    }

    Ok(())
}

Output:

Algorithm: Lz4
------------------------------------------------------------
  Results:
    Compression ratio: 27.25x
    Space savings:     96.3%

Algorithm: Zstd { level: 3 }
------------------------------------------------------------
  Results:
    Compression ratio: 1.50x
    Space savings:     33.3%

Example: Entropy Routing

Shows how data is routed to different backends based on entropy.

use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder, PAGE_SIZE};

fn main() -> anyhow::Result<()> {
    let compressor = CompressorBuilder::new()
        .algorithm(Algorithm::Lz4)
        .build()?;

    // Set entropy threshold to 7.0 bits/byte
    let mut device = BlockDevice::with_entropy_threshold(
        64 * 1024 * 1024,
        compressor,
        7.0,
    );

    // Low entropy - routed to GPU
    let low_entropy = vec![0u8; PAGE_SIZE];
    device.write(0, &low_entropy)?;

    // Medium entropy - routed to SIMD
    let medium: Vec<u8> = (0..PAGE_SIZE).map(|i| (i % 16) as u8).collect();
    device.write(PAGE_SIZE as u64, &medium)?;

    // High entropy - routed to scalar
    let high: Vec<u8> = (0..PAGE_SIZE).map(|i| ((i * 17 + 31) % 256) as u8).collect();
    device.write(2 * PAGE_SIZE as u64, &high)?;

    let stats = device.stats();
    println!("GPU pages:    {}", stats.gpu_pages);
    println!("SIMD pages:   {}", stats.simd_pages);
    println!("Scalar pages: {}", stats.scalar_pages);
    println!("Zero pages:   {}", stats.zero_pages);

    Ok(())
}

Output:

trueno-ublk Entropy Routing Example
===================================

Routing Statistics:
  GPU pages:    1 (low entropy, highly compressible)
  SIMD pages:   1 (medium entropy, normal data)
  Scalar pages: 3 (high entropy, incompressible)
  Zero pages:   1 (all zeros, deduplicated)

Total compression ratio: 2.88x

CLI Examples

Create a Device

# Create 1TB device with LZ4 and GPU acceleration
trueno-ublk create -s 1T -a lz4 --gpu

# Create with memory limit
trueno-ublk create -s 512G -a zstd --mem-limit 8G

# Create with custom entropy threshold
trueno-ublk create -s 256G -a lz4 --entropy-threshold 6.5

Monitor Devices

# List all devices
trueno-ublk list

# Show detailed stats
trueno-ublk stat /dev/ublkb0

# Interactive dashboard
trueno-ublk top

Manage Devices

# Reset statistics
trueno-ublk reset /dev/ublkb0

# Compact memory
trueno-ublk compact /dev/ublkb0

# Set runtime parameters
trueno-ublk set /dev/ublkb0 --mem-limit 16G

Contributing

Development Setup

# Clone the repository
git clone https://github.com/paiml/trueno-zram
cd trueno-zram

# Build
cargo build --all-features

# Run tests
cargo test --workspace --all-features

# Run with CUDA
cargo test --workspace --features cuda

Code Style

  • Format with cargo fmt
  • Lint with cargo clippy --all-features -- -D warnings
  • No panics in library code
  • All public items must be documented

Testing

Unit Tests

cargo test --workspace --all-features

Coverage

cargo llvm-cov --workspace --all-features

Target: 95% line coverage.

Mutation Testing

cargo mutants --package trueno-zram-core

Target: 80% mutation score.

Quality Gates

Before submitting a PR:

  1. Formatting: cargo fmt --check
  2. Linting: cargo clippy --all-features -- -D warnings
  3. Tests: cargo test --workspace --all-features
  4. Documentation: cargo doc --no-deps
  5. Coverage: >= 95%

Commit Messages

Follow conventional commits:

feat: Add new compression algorithm
fix: Handle edge case in decompression
perf: Optimize hash table lookup
docs: Update API documentation
test: Add property-based tests
refactor: Simplify SIMD dispatch

Always reference the work item:

feat: Add GPU batch compression (Refs ZRAM-001)

Pull Request Process

  1. Fork the repository
  2. Create a feature branch
  3. Make changes with tests
  4. Run quality gates
  5. Submit PR with description
  6. Address review feedback

Architecture

See Design Overview for architecture decisions.

Adding a New Algorithm

  1. Create module in src/algorithms/
  2. Implement PageCompressor trait
  3. Add SIMD implementations
  4. Add to Algorithm enum
  5. Write tests and benchmarks
  6. Update documentation

Adding SIMD Backend

  1. Add detection in src/simd/detect.rs
  2. Create implementation file (e.g., avx512.rs)
  3. Add dispatch in src/simd/dispatch.rs
  4. Write correctness tests
  5. Run benchmarks

Reporting Issues

Use GitHub Issues with:

  • Clear description
  • Reproduction steps
  • Expected vs actual behavior
  • System info (CPU, OS, Rust version)

License

Contributions are licensed under MIT OR Apache-2.0.

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.1.0 - 2025-01-04

Added

  • Core compression library (trueno-zram-core)

    • LZ4 compression with AVX2/AVX-512/NEON acceleration
    • ZSTD compression with SIMD optimization
    • Runtime CPU feature detection and dispatch
    • Same-fill detection for 2048:1 zero page compression
  • GPU batch compression

    • CUDA support via cudarc
    • Pure Rust PTX generation via trueno-gpu
    • Warp-cooperative LZ4 kernel (4 warps/block)
    • PCIe 5x rule evaluation
    • Async DMA ring buffer support
  • Kernel compatibility

    • Sysfs interface compatible with kernel zram
    • Algorithm compatibility layer
    • Statistics matching kernel format
  • Benchmarking utilities

    • Criterion benchmarks
    • Example programs for testing
    • Performance measurement infrastructure

Performance

  • LZ4 compression: 4.4 GB/s (AVX-512)
  • LZ4 decompression: 5.4 GB/s (AVX-512)
  • ZSTD compression: 11.2 GB/s (AVX-512)
  • ZSTD decompression: 46 GB/s (AVX-512)
  • Same-fill detection: 22 GB/s

Infrastructure

  • Published to crates.io as trueno-zram-core
  • 461 tests passing
  • 94% test coverage
  • Full documentation with mdBook

0.2.0 - 2026-01-06

Added

  • DT-007: Swap Deadlock Prevention

    • mlock() integration via duende-mlock crate
    • Daemon memory pinning prevents swap deadlock
    • Works in both foreground and background modes
  • trueno-ublk daemon

    • ublk-based block device for kernel bypass
    • Hybrid CPU/GPU architecture
    • 12.5 GB/s sequential read (fio verified)
    • 228K IOPS random 4K read

Performance Improvements

  • Sequential I/O: 16.5 GB/s (1.8x vs kernel ZRAM)
  • Random 4K IOPS: 249K (4.5x vs kernel ZRAM)
  • Compression ratio: 3.87x (+55% vs kernel ZRAM)
  • CPU parallel decompression: 50+ GB/s

Fixed

  • Background mode mlock (DT-007e: mlock called after fork)
  • Clippy warnings in GPU batch compression module

[3.17.0] - 2026-01-07

Added

  • VIZ-001: TruenoCollector - Renacer visualization integration

    • Implements renacer::Collector trait for metrics collection
    • Feeds throughput, IOPS, tier distribution to visualization framework
  • VIZ-002: --visualize flag - Real-time TUI dashboard

    • Tier heatmap, throughput gauge, entropy timeline
    • Requires --foreground mode
  • VIZ-003: Benchmark reports - JSON/HTML export

    • trueno-ublk benchmark --format json|html|text
    • trueno-renacer-v1 schema for ML pipelines
    • Self-contained HTML reports with tier distribution charts
  • VIZ-004: OTLP integration - Distributed tracing

    • --otlp-endpoint and --otlp-service-name flags
    • Export traces to Jaeger/Tempo
  • Release Verification Matrix (docs/release_qa_checklist.md)

    • Falsification-first QA protocol
    • Performance thresholds: >7.2 GB/s zero-page, >550K IOPS

Performance

  • ZSTD Recommendation: ZSTD-1 is 3x faster than LZ4 on AVX-512
    • Compress: 15.4 GiB/s (vs 5.2 GiB/s LZ4)
    • Decompress: ~10 GiB/s (vs ~1.5 GiB/s LZ4)
    • Usage: --algorithm zstd

Documentation

  • New book chapter: Visualization & Observability
  • Updated kernel-zram-parity roadmap (all items COMPLETE)
  • New examples:
    • visualization_demo - TruenoCollector metrics demo
    • zstd_vs_lz4 - Algorithm performance comparison
    • tiered_storage - Kernel-cooperative architecture demo

Unreleased

Planned

  • trueno-zram-adaptive: ML-driven algorithm selection
  • trueno-zram-generator: systemd integration
  • trueno-zram-cli: zramctl replacement