Introduction

trueno-zram is a SIMD and GPU-accelerated memory compression library for Linux systems. It provides userspace Rust implementations of LZ4 and ZSTD compression that leverage modern CPU vector instructions (AVX2, AVX-512, NEON) and optional CUDA GPU acceleration.

Production Status (2026-01-06)

MILESTONE DT-005 ACHIEVED: trueno-zram is running as system swap!

8GB device active as primary swap (priority 150)
Validated under memory pressure with ~185MB swap utilization
CPU SIMD compress (20-30 GB/s) + CPU parallel decompress (48 GB/s)

DT-007 COMPLETED: Swap deadlock issue FIXED via mlock() - 211 MB daemon memory pinned.

Known Limitations:

GPU compression blocked by NVIDIA F082 bug (F081 was falsified, using CPU SIMD instead)
I/O throughput lower than kernel ZRAM (userspace overhead)

Why trueno-zram?

trueno-zram provides better compression efficiency than kernel ZRAM:

Advantage	trueno-zram	Kernel ZRAM
Compression Ratio	3.87x	2.5x
Space Efficiency	55% better	baseline
P99 Latency	16.5 µs	varies

Trade-off: Kernel ZRAM has higher raw I/O throughput (operates entirely in kernel space). trueno-zram uses ublk which adds userspace overhead but provides:

Runtime SIMD dispatch: Automatically selects AVX-512, AVX2, or NEON based on CPU
Userspace flexibility: debugging, monitoring, custom algorithms
Adaptive algorithm selection: ML-driven selection based on page entropy
Better compression: 3.87x vs kernel’s 2.5x

Validated Performance (QA-FALSIFY-001)

Claim	Result	Status
Compression ratio	3.87x	PASS
SIMD compression	20-30 GB/s	PASS
SIMD decompression	48 GB/s	PASS
P99 latency	16.5 µs	PASS
mlock (DT-007)	211 MB	PASS

~~Falsified Claim~~	Actual
~~1.8x vs kernel I/O~~	Kernel 3-13x faster
~~228K IOPS~~	123K IOPS

All metrics independently verified via falsification testing (2026-01-06)

Part of the PAIML Ecosystem

trueno-zram is part of the “Batuta Stack”:

trueno - High-performance SIMD compute library
trueno-gpu - Pure Rust PTX generation for CUDA
aprender - Machine learning in pure Rust
certeza - Asymptotic test effectiveness framework

License

trueno-zram is dual-licensed under MIT and Apache-2.0.

Installation

From crates.io

Add trueno-zram-core to your Cargo.toml:

[dependencies]
trueno-zram-core = "0.1"

Or use cargo add:

cargo add trueno-zram-core

With CUDA Support

For GPU acceleration, enable the cuda feature:

[dependencies]
trueno-zram-core = { version = "0.1", features = ["cuda"] }

Or:

cargo add trueno-zram-core --features cuda

CUDA Requirements

CUDA Toolkit 12.8 or later
NVIDIA driver supporting CUDA 12.8
GPU with compute capability >= 7.0 (Volta or newer)

Feature Flags

Feature	Description	Default
`std`	Standard library support	Yes
`nightly`	Nightly-only SIMD features	No
`cuda`	CUDA GPU acceleration	No

System Requirements

OS: Linux (kernel >= 5.10 LTS)
CPU: x86_64 (AVX2/AVX-512) or AArch64 (NEON)
Rust: 1.82.0 or later (MSRV)

Verifying Installation

use trueno_zram_core::{CompressorBuilder, Algorithm};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let compressor = CompressorBuilder::new()
        .algorithm(Algorithm::Lz4)
        .build()?;

    println!("trueno-zram installed successfully!");
    println!("SIMD backend: {:?}", compressor.backend());

    Ok(())
}

Building from Source

git clone https://github.com/paiml/trueno-zram
cd trueno-zram

# Build all crates
cargo build --release --all-features

# Run tests
cargo test --workspace --all-features

# Build with CUDA
cargo build --release --features cuda

Quick Start

Basic Compression

use trueno_zram_core::{CompressorBuilder, Algorithm, PAGE_SIZE};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a compressor with LZ4 algorithm
    let compressor = CompressorBuilder::new()
        .algorithm(Algorithm::Lz4)
        .build()?;

    // Create a test page (4KB)
    let mut page = [0u8; PAGE_SIZE];
    page[0..100].copy_from_slice(&[0xAA; 100]);

    // Compress
    let compressed = compressor.compress(&page)?;
    println!("Original size: {} bytes", PAGE_SIZE);
    println!("Compressed size: {} bytes", compressed.data.len());
    println!("Ratio: {:.2}x", compressed.ratio());

    // Decompress
    let decompressed = compressor.decompress(&compressed)?;
    assert_eq!(page, decompressed);
    println!("Decompression verified!");

    Ok(())
}

Choosing an Algorithm

#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};

// LZ4: Fastest compression, good for most workloads
let lz4 = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;

// ZSTD Level 1: Better ratio, still fast
let zstd_fast = CompressorBuilder::new()
    .algorithm(Algorithm::Zstd { level: 1 })
    .build()?;

// ZSTD Level 3: Best ratio for compressible data
let zstd_best = CompressorBuilder::new()
    .algorithm(Algorithm::Zstd { level: 3 })
    .build()?;

// Adaptive: Automatically selects based on entropy
let adaptive = CompressorBuilder::new()
    .algorithm(Algorithm::Adaptive)
    .build()?;
}

Compression Statistics

#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;

// Compress some pages
for _ in 0..100 {
    let page = [0u8; PAGE_SIZE];
    let _ = compressor.compress(&page)?;
}

// Get statistics
let stats = compressor.stats();
println!("Pages compressed: {}", stats.pages);
println!("Total input: {} bytes", stats.bytes_in);
println!("Total output: {} bytes", stats.bytes_out);
println!("Overall ratio: {:.2}x", stats.ratio());
println!("Throughput: {:.2} GB/s", stats.throughput_gbps());
}

Error Handling

use trueno_zram_core::{CompressorBuilder, Algorithm, Error};

fn compress_page(data: &[u8; PAGE_SIZE]) -> Result<Vec<u8>, Error> {
    let compressor = CompressorBuilder::new()
        .algorithm(Algorithm::Lz4)
        .build()?;

    let compressed = compressor.compress(data)?;
    Ok(compressed.data)
}

fn main() {
    let page = [0u8; PAGE_SIZE];

    match compress_page(&page) {
        Ok(data) => println!("Compressed to {} bytes", data.len()),
        Err(Error::BufferTooSmall(msg)) => eprintln!("Buffer error: {msg}"),
        Err(Error::CorruptedData(msg)) => eprintln!("Corrupt data: {msg}"),
        Err(e) => eprintln!("Other error: {e}"),
    }
}

Next Steps

Learn about GPU Batch Compression
Explore SIMD Acceleration
See more Examples

Examples

Running Examples

trueno-zram includes several examples to demonstrate its features:

# GPU information and backend selection
cargo run -p trueno-zram-core --example gpu_info
cargo run -p trueno-zram-core --example gpu_info --features cuda

# Compression benchmarks
cargo run -p trueno-zram-core --example compress_benchmark --release
cargo run -p trueno-zram-core --example compress_benchmark --release --features cuda

GPU Info Example

Shows GPU detection, backend selection logic, and PCIe 5x rule evaluation:

trueno-zram GPU Information
============================

1. GPU Availability
   ----------------
   GPU available: true

   CUDA Device Information:
     Device: NVIDIA GeForce RTX 4090
     Compute Capability: SM 8.9
     Memory: 24564 MB
     L2 Cache: 73728 KB
     Optimal batch: 14745 pages
     Supported: true

2. Backend Selection Logic
   -----------------------
   GPU_MIN_BATCH_SIZE: 1000 pages
   PAGE_SIZE: 4096 bytes

      Batch          No GPU        With GPU
   ------------------------------------------
         1          Scalar          Scalar
        10            Simd            Simd
       100            Simd            Simd
       500            Simd            Simd
      1000            Simd             Gpu
      5000            Simd             Gpu
     10000            Simd             Gpu

3. PCIe 5x Rule Evaluation
   -----------------------
   GPU offload beneficial when: T_cpu > 5 * (T_transfer + T_gpu)

   1K pages, PCIe 4.0, 100 GB/s GPU (4 MB): CPU preferred
   10K pages, PCIe 4.0, 100 GB/s GPU (40 MB): GPU beneficial
   100K pages, PCIe 5.0, 500 GB/s GPU (400 MB): GPU beneficial

Compression Benchmark

Measures throughput across different data patterns:

trueno-zram Compression Benchmark
=================================

Pattern: Zeros (compressible)
----------------------------------------------------------------------
   Pages  Algorithm     Compress   Decompress      Ratio    Backend
     100        Lz4      22.01 GB/s      46.02 GB/s   2048.00x Avx512
    1000        Lz4      21.87 GB/s      45.91 GB/s   2048.00x Avx512

Pattern: Text (compressible)
----------------------------------------------------------------------
   Pages  Algorithm     Compress   Decompress      Ratio    Backend
     100        Lz4       4.57 GB/s       5.42 GB/s      3.21x Avx512
    1000        Lz4       4.44 GB/s       5.37 GB/s      3.21x Avx512

Pattern: Random (incompressible)
----------------------------------------------------------------------
   Pages  Algorithm     Compress   Decompress      Ratio    Backend
     100        Lz4       1.87 GB/s      43.78 GB/s      1.00x Avx512
    1000        Lz4       1.61 GB/s      31.64 GB/s      1.00x Avx512

Custom Example: Batch Processing

use trueno_zram_core::{CompressorBuilder, Algorithm, PAGE_SIZE};
use std::time::Instant;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let compressor = CompressorBuilder::new()
        .algorithm(Algorithm::Lz4)
        .build()?;

    // Generate test pages
    let pages: Vec<[u8; PAGE_SIZE]> = (0..1000)
        .map(|i| {
            let mut page = [0u8; PAGE_SIZE];
            // Create compressible pattern
            for (j, byte) in page.iter_mut().enumerate() {
                *byte = ((i + j) % 256) as u8;
            }
            page
        })
        .collect();

    // Benchmark compression
    let start = Instant::now();
    let mut total_compressed = 0;

    for page in &pages {
        let compressed = compressor.compress(page)?;
        total_compressed += compressed.data.len();
    }

    let elapsed = start.elapsed();
    let input_bytes = pages.len() * PAGE_SIZE;
    let throughput = input_bytes as f64 / elapsed.as_secs_f64() / 1e9;
    let ratio = input_bytes as f64 / total_compressed as f64;

    println!("Compressed {} pages in {:?}", pages.len(), elapsed);
    println!("Throughput: {:.2} GB/s", throughput);
    println!("Compression ratio: {:.2}x", ratio);

    Ok(())
}

Custom Example: GPU Batch Compression

use trueno_zram_core::gpu::{GpuBatchCompressor, GpuBatchConfig, gpu_available};
use trueno_zram_core::{Algorithm, PAGE_SIZE};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    if !gpu_available() {
        println!("No GPU available, skipping GPU example");
        return Ok(());
    }

    let config = GpuBatchConfig {
        device_index: 0,
        algorithm: Algorithm::Lz4,
        batch_size: 1000,
        async_dma: true,
        ring_buffer_slots: 4,
    };

    let mut compressor = GpuBatchCompressor::new(config)?;

    // Create batch of pages
    let pages: Vec<[u8; PAGE_SIZE]> = vec![[0u8; PAGE_SIZE]; 1000];

    // Compress batch
    let result = compressor.compress_batch(&pages)?;

    println!("Batch Results:");
    println!("  Pages: {}", result.pages.len());
    println!("  H2D time: {} ns", result.h2d_time_ns);
    println!("  Kernel time: {} ns", result.kernel_time_ns);
    println!("  D2H time: {} ns", result.d2h_time_ns);
    println!("  Total time: {} ns", result.total_time_ns);
    println!("  Compression ratio: {:.2}x", result.compression_ratio());
    println!("  PCIe 5x rule satisfied: {}", result.pcie_rule_satisfied());

    // Get compressor stats
    let stats = compressor.stats();
    println!("\nCompressor Stats:");
    println!("  Pages compressed: {}", stats.pages_compressed);
    println!("  Throughput: {:.2} GB/s", stats.throughput_gbps());

    Ok(())
}

Compression Algorithms

trueno-zram supports multiple compression algorithms optimized for memory page compression.

LZ4

LZ4 is a lossless compression algorithm focused on compression and decompression speed.

#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};

let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;
}

Characteristics

Metric	Value
Compression speed	4.4 GB/s (AVX-512)
Decompression speed	5.4 GB/s (AVX-512)
Typical ratio	2-4x for compressible data
Best for	Speed-critical workloads

When to Use LZ4

Real-time compression requirements
High-throughput memory compression
When CPU overhead must be minimal
Mixed workloads with varying compressibility

ZSTD (Zstandard)

ZSTD provides better compression ratios while maintaining good speed.

#![allow(unused)]
fn main() {
// Fast mode (level 1)
let fast = CompressorBuilder::new()
    .algorithm(Algorithm::Zstd { level: 1 })
    .build()?;

// Better compression (level 3)
let better = CompressorBuilder::new()
    .algorithm(Algorithm::Zstd { level: 3 })
    .build()?;
}

Compression Levels

Level	Compression	Decompression	Ratio
1	11.2 GB/s	46 GB/s	Better than LZ4
3	8.5 GB/s	45 GB/s	Best

When to Use ZSTD

Memory-constrained systems
Highly compressible data (text, logs)
When compression ratio matters more than speed
Cold/archived memory pages

Adaptive Selection

The adaptive algorithm automatically selects based on page entropy:

#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Adaptive)
    .build()?;
}

Selection Logic

Same-fill detection: Pages with repeated values use 2048:1 encoding
Entropy analysis: Shannon entropy determines compressibility
Algorithm selection:
- Low entropy (< 4 bits): ZSTD for best ratio
- Medium entropy (4-7 bits): LZ4 for balance
- High entropy (> 7 bits): Pass-through (incompressible)

Same-Fill Optimization

Pages filled with the same byte value get special 2048:1 compression:

#![allow(unused)]
fn main() {
use trueno_zram_core::samefill::{detect_same_fill, CompactSameFill};

let zero_page = [0u8; 4096];

if let Some(fill) = detect_same_fill(&zero_page) {
    let compact = CompactSameFill::new(fill);
    // Only 2 bytes needed to represent 4096-byte page!
    assert_eq!(compact.to_bytes().len(), 2);
}
}

Same-Fill Statistics

Zero pages: ~30-40% of typical memory
Same-fill pages: ~35-45% total
Compression ratio: 2048:1

Algorithm Comparison

Algorithm	Compress	Decompress	Ratio	Use Case
LZ4	4.4 GB/s	5.4 GB/s	2-4x	General
ZSTD-1	11.2 GB/s	46 GB/s	3-5x	Balanced
ZSTD-3	8.5 GB/s	45 GB/s	4-6x	Best ratio
Same-fill	N/A	N/A	2048x	Zero/repeated
Adaptive	Varies	Varies	Optimal	Automatic

SIMD Acceleration

trueno-zram uses runtime CPU feature detection to select the optimal SIMD implementation.

Supported Backends

Backend	Instruction Set	Register Width	Platforms
AVX-512	AVX-512F/BW/VL	512-bit	Skylake-X, Ice Lake, Zen 4
AVX2	AVX2 + FMA	256-bit	Haswell+, Zen 1+
NEON	ARM NEON	128-bit	ARMv8-A (AArch64)
Scalar	None	64-bit	All platforms

Runtime Detection

#![allow(unused)]
fn main() {
use trueno_zram_core::simd::{detect, SimdFeatures};

let features = detect();

println!("AVX-512: {}", features.has_avx512());
println!("AVX2: {}", features.has_avx2());
println!("SSE4.2: {}", features.has_sse42());
}

Automatic Dispatch

The compressor automatically uses the best available backend:

#![allow(unused)]
fn main() {
use trueno_zram_core::CompressorBuilder;

let compressor = CompressorBuilder::new().build()?;

// Check which backend was selected
println!("Backend: {:?}", compressor.backend());
}

Performance by Backend

LZ4 Compression

Backend	Throughput	Relative
AVX-512	4.4 GB/s	1.45x
AVX2	3.2 GB/s	1.05x
Scalar	3.0 GB/s	1.0x

ZSTD Compression

Backend	Throughput	Relative
AVX-512	11.2 GB/s	1.40x
AVX2	8.5 GB/s	1.06x
Scalar	8.0 GB/s	1.0x

SIMD Optimizations

Hash Table Lookups (LZ4)

AVX-512 enables parallel hash probing for match finding:

// Scalar: Sequential probe
for offset in 0..16 {
    if hash_table[hash + offset] == pattern { ... }
}

// AVX-512: Parallel probe (16 comparisons at once)
let matches = _mm512_cmpeq_epi32(hash_values, pattern_broadcast);

Literal Copying

Wide vector moves for copying uncompressed literals:

// AVX-512: 64 bytes per iteration
_mm512_storeu_si512(dst, _mm512_loadu_si512(src));

// AVX2: 32 bytes per iteration
_mm256_storeu_si256(dst, _mm256_loadu_si256(src));

Match Extension

SIMD comparison for extending matches:

#![allow(unused)]
fn main() {
// Compare 64 bytes at once with AVX-512
let cmp = _mm512_cmpeq_epi8(src_chunk, dst_chunk);
let mask = _mm512_movepi8_mask(cmp);
let match_len = mask.trailing_ones();
}

Forcing a Backend

For testing or benchmarking, you can force a specific backend:

#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, SimdBackend};

// Force scalar (no SIMD)
let scalar = CompressorBuilder::new()
    .prefer_backend(SimdBackend::Scalar)
    .build()?;

// Force AVX2 (will fail if not available)
let avx2 = CompressorBuilder::new()
    .prefer_backend(SimdBackend::Avx2)
    .build()?;
}

GPU Batch Compression

trueno-zram supports CUDA GPU acceleration for batch compression of memory pages.

Current Status (2026-01-06)

Important: GPU compression is currently blocked by NVIDIA PTX bug F082 (Computed Address Bug). The production architecture uses:

Compression: CPU SIMD (AVX-512) at 20-30 GB/s with 3.87x ratio
Decompression: CPU parallel at 50+ GB/s (primary path)

GPU decompression is available but CPU parallel path is faster due to PCIe transfer overhead (~6 GB/s end-to-end vs 50+ GB/s CPU).

NVIDIA F082 Bug (Computed Address Bug)

F081 (Loaded Value Bug) was FALSIFIED on 2026-01-05 - the pattern works correctly.

The actual bug is F082: addresses computed from shared memory values cause crashes:

// F081 pattern - WORKS CORRECTLY (falsified):
ld.shared.u32 %r_val, [addr];      // Load from shared memory
st.global.u32 [dest], %r_val;      // Actually works!

// F082 pattern - CRASHES:
ld.shared.u32 %r_offset, [shared_addr];  // Load offset from shared memory
add.u64 %r_dest, %r_base, %r_offset;     // Compute destination address
st.global.u32 [%r_dest], %r_data;        // CRASH - address derived from shared load

Status: True root cause identified via Popperian falsification. See KF-002 and ublk-batched-gpu-compression.md for details.

When to Use GPU

GPU decompression is beneficial when:

Large batches: 2000+ pages to decompress
PCIe 5x rule satisfied: Computation time > 5x transfer time
GPU available: CUDA-capable GPU with SM 7.0+

Basic Usage

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{GpuBatchCompressor, GpuBatchConfig, gpu_available};
use trueno_zram_core::{Algorithm, PAGE_SIZE};

if gpu_available() {
    let config = GpuBatchConfig {
        device_index: 0,
        algorithm: Algorithm::Lz4,
        batch_size: 1000,
        async_dma: true,
        ring_buffer_slots: 4,
    };

    let mut compressor = GpuBatchCompressor::new(config)?;

    let pages: Vec<[u8; PAGE_SIZE]> = vec![[0u8; PAGE_SIZE]; 1000];
    let result = compressor.compress_batch(&pages)?;

    println!("Compression ratio: {:.2}x", result.compression_ratio());
}
}

Configuration Options

#![allow(unused)]
fn main() {
pub struct GpuBatchConfig {
    /// CUDA device index (0 = first GPU)
    pub device_index: u32,

    /// Compression algorithm
    pub algorithm: Algorithm,

    /// Number of pages per batch
    pub batch_size: usize,

    /// Enable async DMA transfers
    pub async_dma: bool,

    /// Ring buffer slots for pipelining
    pub ring_buffer_slots: usize,
}
}

Batch Results

The BatchResult provides timing breakdown:

#![allow(unused)]
fn main() {
let result = compressor.compress_batch(&pages)?;

// Timing components
println!("H2D transfer: {} ns", result.h2d_time_ns);
println!("Kernel execution: {} ns", result.kernel_time_ns);
println!("D2H transfer: {} ns", result.d2h_time_ns);
println!("Total time: {} ns", result.total_time_ns);

// Metrics
let throughput = result.throughput_bytes_per_sec(pages.len() * PAGE_SIZE);
println!("Throughput: {:.2} GB/s", throughput / 1e9);
println!("Compression ratio: {:.2}x", result.compression_ratio());
println!("PCIe 5x rule satisfied: {}", result.pcie_rule_satisfied());
}

Backend Selection

Use select_backend to determine optimal backend:

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{select_backend, gpu_available};

let batch_size = 5000;
let has_gpu = gpu_available();

let backend = select_backend(batch_size, has_gpu);
println!("Selected backend: {:?}", backend);
// Output: Gpu (for large batches with GPU available)
}

Supported GPUs

GPU	Architecture	SM	Optimal Batch
H100	Hopper	9.0	10,240 pages
A100	Ampere	8.0	8,192 pages
RTX 4090	Ada	8.9	14,745 pages
RTX 3090	Ampere	8.6	6,144 pages

Pure Rust PTX Generation

trueno-zram uses trueno-gpu for pure Rust PTX generation:

No LLVM dependency
No nvcc required
Kernel code in Rust, compiled to PTX at runtime
Warp-cooperative LZ4 compression (4 warps/block)

#![allow(unused)]
fn main() {
// The LZ4 kernel processes 4 pages per block (1 page per warp)
// Uses shared memory for hash tables and match finding
// cvta.shared.u64 for generic addressing
}

Same-Fill Detection

Same-fill optimization provides 2048:1 compression for pages containing a single repeated byte value.

Why Same-Fill Matters

Memory pages often contain:

Zero pages: ~30-40% of typical memory (uninitialized, freed)
Same-fill pages: ~5-10% additional (memset patterns)

Detecting and encoding these specially provides massive compression wins.

Detection

#![allow(unused)]
fn main() {
use trueno_zram_core::samefill::detect_same_fill;
use trueno_zram_core::PAGE_SIZE;

let zero_page = [0u8; PAGE_SIZE];
let pattern_page = [0xAA; PAGE_SIZE];
let mixed_page = [0u8; PAGE_SIZE];

// Zero page detected
assert!(detect_same_fill(&zero_page).is_some());

// Pattern page detected
assert!(detect_same_fill(&pattern_page).is_some());

// Mixed content not detected
let mut mixed = [0u8; PAGE_SIZE];
mixed[100] = 0xFF;
assert!(detect_same_fill(&mixed).is_none());
}

Compact Encoding

Same-fill pages compress to just 2 bytes:

#![allow(unused)]
fn main() {
use trueno_zram_core::samefill::CompactSameFill;

let fill_value = 0u8;
let compact = CompactSameFill::new(fill_value);

// Serialize (2 bytes)
let bytes = compact.to_bytes();
assert_eq!(bytes.len(), 2);

// Deserialize
let restored = CompactSameFill::from_bytes(&bytes)?;
assert_eq!(restored.fill_value(), 0);

// Expand back to full page
let page = restored.expand();
assert_eq!(page.len(), PAGE_SIZE);
assert!(page.iter().all(|&b| b == 0));
}

Compression Ratio

Page Type	Original	Compressed	Ratio
Zero-fill	4096 B	2 B	2048:1
0xFF-fill	4096 B	2 B	2048:1
Any same-fill	4096 B	2 B	2048:1

Integration with Compressor

The compressor automatically detects same-fill pages:

#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};

let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;

let zero_page = [0u8; PAGE_SIZE];
let compressed = compressor.compress(&zero_page)?;

// Same-fill pages get special encoding
println!("Compressed size: {} bytes", compressed.data.len());
// Output: ~20 bytes (LZ4 minimal encoding for same-fill)
}

Performance

Same-fill detection is extremely fast:

#![allow(unused)]
fn main() {
// SIMD-accelerated comparison
// AVX-512: Check 64 bytes per iteration
// AVX2: Check 32 bytes per iteration
// Typical: <100ns for 4KB page
}

Memory Statistics

On typical systems:

Memory Type	Same-Fill %
Idle desktop	60-70%
Active workload	35-45%
Database server	25-35%
Compilation	40-50%

CompressorBuilder API

The CompressorBuilder provides a fluent API for configuring compression.

Basic Usage

#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};

let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;
}

Builder Methods

`algorithm(algo: Algorithm)`

Sets the compression algorithm:

#![allow(unused)]
fn main() {
// LZ4 (fastest)
.algorithm(Algorithm::Lz4)

// ZSTD with compression level
.algorithm(Algorithm::Zstd { level: 1 })
.algorithm(Algorithm::Zstd { level: 3 })

// Adaptive (auto-select based on entropy)
.algorithm(Algorithm::Adaptive)
}

`prefer_backend(backend: SimdBackend)`

Forces a specific SIMD backend:

#![allow(unused)]
fn main() {
use trueno_zram_core::SimdBackend;

// Force scalar (no SIMD)
.prefer_backend(SimdBackend::Scalar)

// Force AVX2
.prefer_backend(SimdBackend::Avx2)

// Force AVX-512
.prefer_backend(SimdBackend::Avx512)
}

`build()`

Creates the compressor:

#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?; // Returns Result<Compressor, Error>
}

Compressor Methods

`compress(&self, page: &[u8; PAGE_SIZE]) -> Result<CompressedPage>`

Compresses a single page:

#![allow(unused)]
fn main() {
let page = [0u8; PAGE_SIZE];
let compressed = compressor.compress(&page)?;

println!("Size: {} bytes", compressed.data.len());
println!("Ratio: {:.2}x", compressed.ratio());
}

`decompress(&self, page: &CompressedPage) -> Result<[u8; PAGE_SIZE]>`

Decompresses a page:

#![allow(unused)]
fn main() {
let decompressed = compressor.decompress(&compressed)?;
assert_eq!(page, decompressed);
}

`stats(&self) -> CompressionStats`

Returns compression statistics:

#![allow(unused)]
fn main() {
let stats = compressor.stats();

println!("Pages: {}", stats.pages);
println!("Bytes in: {}", stats.bytes_in);
println!("Bytes out: {}", stats.bytes_out);
println!("Ratio: {:.2}x", stats.ratio());
println!("Compress time: {} ns", stats.compress_time_ns);
println!("Decompress time: {} ns", stats.decompress_time_ns);
println!("Throughput: {:.2} GB/s", stats.throughput_gbps());
}

`reset_stats(&mut self)`

Resets statistics counters:

#![allow(unused)]
fn main() {
compressor.reset_stats();
}

`backend(&self) -> SimdBackend`

Returns the active SIMD backend:

#![allow(unused)]
fn main() {
println!("Backend: {:?}", compressor.backend());
// Output: Avx512, Avx2, Neon, or Scalar
}

CompressedPage

The compressed page structure:

#![allow(unused)]
fn main() {
pub struct CompressedPage {
    /// Compressed data
    pub data: Vec<u8>,

    /// Original size (always PAGE_SIZE)
    pub original_size: usize,

    /// Algorithm used
    pub algorithm: Algorithm,
}

impl CompressedPage {
    /// Compression ratio
    pub fn ratio(&self) -> f64;

    /// Bytes saved
    pub fn bytes_saved(&self) -> usize;

    /// Check if actually compressed
    pub fn is_compressed(&self) -> bool;
}
}

Error Handling

#![allow(unused)]
fn main() {
use trueno_zram_core::Error;

match compressor.compress(&page) {
    Ok(compressed) => { /* success */ }
    Err(Error::BufferTooSmall(msg)) => { /* buffer issue */ }
    Err(Error::CorruptedData(msg)) => { /* corrupt input */ }
    Err(Error::InvalidInput(msg)) => { /* invalid params */ }
    Err(e) => { /* other error */ }
}
}

GPU Batch API

The GPU batch API provides high-throughput compression for large page batches.

GpuBatchConfig

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::GpuBatchConfig;
use trueno_zram_core::Algorithm;

let config = GpuBatchConfig {
    device_index: 0,        // CUDA device (0 = first GPU)
    algorithm: Algorithm::Lz4,
    batch_size: 1000,       // Pages per batch
    async_dma: true,        // Enable async transfers
    ring_buffer_slots: 4,   // Pipeline depth
};
}

GpuBatchCompressor

Creation

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{GpuBatchCompressor, GpuBatchConfig};

let config = GpuBatchConfig::default();
let mut compressor = GpuBatchCompressor::new(config)?;
}

Batch Compression

#![allow(unused)]
fn main() {
let pages: Vec<[u8; PAGE_SIZE]> = vec![[0u8; PAGE_SIZE]; 1000];
let result = compressor.compress_batch(&pages)?;
}

Statistics

#![allow(unused)]
fn main() {
let stats = compressor.stats();

println!("Pages compressed: {}", stats.pages_compressed);
println!("Input bytes: {}", stats.total_bytes_in);
println!("Output bytes: {}", stats.total_bytes_out);
println!("Time: {} ns", stats.total_time_ns);
println!("Ratio: {:.2}x", stats.compression_ratio());
println!("Throughput: {:.2} GB/s", stats.throughput_gbps());
}

Configuration Access

#![allow(unused)]
fn main() {
let config = compressor.config();
println!("Batch size: {}", config.batch_size);
println!("Async DMA: {}", config.async_dma);
}

BatchResult

#![allow(unused)]
fn main() {
pub struct BatchResult {
    /// Compressed pages
    pub pages: Vec<CompressedPage>,

    /// Host-to-device transfer time (ns)
    pub h2d_time_ns: u64,

    /// Kernel execution time (ns)
    pub kernel_time_ns: u64,

    /// Device-to-host transfer time (ns)
    pub d2h_time_ns: u64,

    /// Total wall clock time (ns)
    pub total_time_ns: u64,
}
}

Methods

#![allow(unused)]
fn main() {
// Throughput in bytes/second
let throughput = result.throughput_bytes_per_sec(input_bytes);

// Compression ratio
let ratio = result.compression_ratio();

// Check PCIe 5x rule
let beneficial = result.pcie_rule_satisfied();
}

Helper Functions

`gpu_available()`

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::gpu_available;

if gpu_available() {
    println!("CUDA GPU detected");
}
}

`select_backend()`

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{select_backend, BackendSelection};

let backend = select_backend(batch_size, gpu_available());

match backend {
    BackendSelection::Gpu => { /* use GPU */ }
    BackendSelection::Simd => { /* use CPU SIMD */ }
    BackendSelection::Scalar => { /* use scalar */ }
}
}

`meets_pcie_rule()`

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::meets_pcie_rule;

let pages = 10000;
let pcie_bandwidth = 64.0;  // GB/s (PCIe 5.0)
let gpu_throughput = 500.0; // GB/s

if meets_pcie_rule(pages, pcie_bandwidth, gpu_throughput) {
    println!("GPU offload beneficial");
}
}

GpuDeviceInfo

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::GpuDeviceInfo;

let info = GpuDeviceInfo {
    index: 0,
    name: "RTX 4090".to_string(),
    total_memory: 24 * 1024 * 1024 * 1024,
    l2_cache_size: 72 * 1024 * 1024,
    compute_capability: (8, 9),
    backend: GpuBackend::Cuda,
};

println!("Optimal batch: {} pages", info.optimal_batch_size());
println!("Supported: {}", info.is_supported());
}

Kernel Compatibility API

The compat module provides sysfs interface compatibility with kernel zram.

SysfsInterface

Emulates the kernel zram sysfs interface:

#![allow(unused)]
fn main() {
use trueno_zram_core::compat::SysfsInterface;

let mut interface = SysfsInterface::new();

// Configure like kernel zram
interface.write_attr("disksize", "4G")?;
interface.write_attr("comp_algorithm", "lz4")?;
interface.write_attr("mem_limit", "2G")?;

// Read attributes
let disksize = interface.read_attr("disksize")?;
let algorithm = interface.read_attr("comp_algorithm")?;
}

Supported Attributes

Attribute	Read	Write	Description
`disksize`	Yes	Yes	Disk size in bytes
`comp_algorithm`	Yes	Yes	Compression algorithm
`mem_limit`	Yes	Yes	Memory limit
`mem_used_max`	Yes	No	Peak memory usage
`mem_used_total`	Yes	No	Current memory usage
`orig_data_size`	Yes	No	Original data size
`compr_data_size`	Yes	No	Compressed data size
`num_reads`	Yes	No	Read count
`num_writes`	Yes	No	Write count
`invalid_io`	Yes	No	Invalid I/O count
`notify_free`	Yes	No	Free notifications
`reset`	No	Yes	Reset device

Algorithm Support

#![allow(unused)]
fn main() {
use trueno_zram_core::compat::{ZramAlgorithm, is_algorithm_supported};

// Check algorithm support
assert!(is_algorithm_supported("lz4"));
assert!(is_algorithm_supported("zstd"));
assert!(is_algorithm_supported("lzo"));  // Compatibility alias

// Parse algorithm
let algo = "lz4".parse::<ZramAlgorithm>()?;
}

Supported Algorithms

Name	Alias	Description
`lz4`	-	LZ4 fast compression
`lz4hc`	-	LZ4 high compression
`zstd`	-	Zstandard
`lzo`	`lzo-rle`	LZO (mapped to LZ4)
`842`	`deflate`	842/deflate (mapped to ZSTD)

Statistics

MmStat

Memory statistics (like /sys/block/zram0/mm_stat):

#![allow(unused)]
fn main() {
use trueno_zram_core::compat::MmStat;

let stat = interface.mm_stat();

println!("Original size: {} bytes", stat.orig_data_size);
println!("Compressed size: {} bytes", stat.compr_data_size);
println!("Memory used: {} bytes", stat.mem_used_total);
println!("Memory limit: {} bytes", stat.mem_limit);
println!("Memory max: {} bytes", stat.mem_used_max);
println!("Same pages: {}", stat.same_pages);
println!("Pages stored: {}", stat.pages_compacted);
println!("Huge pages: {}", stat.huge_pages);
}

IoStat

I/O statistics (like /sys/block/zram0/io_stat):

#![allow(unused)]
fn main() {
use trueno_zram_core::compat::IoStat;

let stat = interface.io_stat();

println!("Reads: {}", stat.num_reads);
println!("Writes: {}", stat.num_writes);
println!("Invalid I/O: {}", stat.invalid_io);
println!("Notify free: {}", stat.notify_free);
}

Device Reset

#![allow(unused)]
fn main() {
// Reset all statistics and data
interface.write_attr("reset", "1")?;
}

Integration Example

#![allow(unused)]
fn main() {
use trueno_zram_core::compat::SysfsInterface;
use trueno_zram_core::PAGE_SIZE;

let mut interface = SysfsInterface::new();

// Configure
interface.write_attr("disksize", "1G")?;
interface.write_attr("comp_algorithm", "lz4")?;

// Simulate writes
let page = [0xAA; PAGE_SIZE];
interface.write_page(0, &page)?;
interface.write_page(1, &page)?;

// Check statistics
let mm = interface.mm_stat();
println!("Compression ratio: {:.2}x",
    mm.orig_data_size as f64 / mm.compr_data_size as f64);
}

Benchmarks

Running Benchmarks

# Criterion benchmarks
cargo bench --all-features

# With baseline comparison
cargo bench --all-features -- --save-baseline main

# Example benchmarks
cargo run -p trueno-zram-core --example compress_benchmark --release

Results Summary

LZ4 Performance

Backend	Compress	Decompress	Ratio
AVX-512	4.4 GB/s	5.4 GB/s	2-4x
AVX2	3.2 GB/s	4.1 GB/s	2-4x
Scalar	3.0 GB/s	3.8 GB/s	2-4x

ZSTD Performance

Backend	Level	Compress	Decompress	Ratio
AVX-512	1	11.2 GB/s	46 GB/s	3-5x
AVX-512	3	8.5 GB/s	45 GB/s	4-6x
AVX2	1	8.5 GB/s	35 GB/s	3-5x

Same-Fill Performance

Backend	Detection	Ratio
AVX-512	22 GB/s	2048:1
AVX2	18 GB/s	2048:1
Scalar	12 GB/s	2048:1

Data Patterns

Zeros (Best Case)

Pattern: Zeros (100% same-fill)
Pages: 1000
Compression: 22 GB/s
Decompression: 46 GB/s
Ratio: 2048:1

Text (Compressible)

Pattern: Text/Code
Pages: 1000
LZ4 Compression: 4.4 GB/s
LZ4 Decompression: 5.4 GB/s
Ratio: 3.2:1

Random (Incompressible)

Pattern: Random bytes
Pages: 1000
LZ4 Compression: 1.6 GB/s
LZ4 Decompression: 32 GB/s
Ratio: 1.0:1 (pass-through)

GPU Benchmarks

Note (2026-01-06): GPU decompression is limited by PCIe transfer overhead. CPU parallel path (50+ GB/s) is faster for most workloads.

RTX 4090 (Validated)

Path	Throughput	Notes
CPU Parallel	50+ GB/s	Primary recommended path
GPU End-to-End	~6 GB/s	PCIe 4.0 transfer bottleneck
GPU Kernel-only	~9 GB/s	Without H2D/D2H transfers

Recommendation

Use CPU parallel decompression for best performance. GPU useful for:

Future PCIe 5.0+ systems with higher bandwidth
Workloads where CPU is saturated

Latency

Operation	P50	P99	P99.9
LZ4 compress (4KB)	45us	85us	120us
LZ4 decompress (4KB)	38us	72us	95us
Same-fill detect	8us	15us	25us

Memory Usage

Component	Memory
Hash table (LZ4)	64 KB
Working buffer	16 KB
ZSTD context	256 KB

Comparison with Linux Kernel zram

Validated 2026-01-06 on RTX 4090 + AMD Threadripper 7960X (AVX-512)

Block Device I/O (fio, Direct I/O)

Metric	Kernel ZRAM	trueno-ublk	Speedup
Sequential Read	9.2 GB/s	16.5 GB/s	1.8x
Random 4K IOPS	55K	249K	4.5x
Compression Ratio	2.5x	3.87x	+55%

Compression Engine (cargo examples)

Metric	Linux Kernel zram	trueno-zram	Speedup
Compress (parallel)	3-5 GB/s	20-30 GB/s	5-6x
Decompress (CPU)	~10 GB/s	50+ GB/s	5x
Same-fill detection	~8 GB/s	22 GB/s	2.75x
Compression ratio	2.5x	3.87x	+55%

Architecture Difference

Aspect	Linux Kernel	trueno-zram
Threading	Single-threaded per page	Parallel (rayon)
SIMD	Limited	AVX-512/AVX2/NEON
Batch processing	No	Yes (5000+ pages)
GPU offload	No	Optional CUDA

Falsification Testing (2026-01-06)

Test Configuration:
├── Hardware: AMD Threadripper 7960X, 125GB RAM, RTX 4090
├── Device: trueno-ublk 8GB device
├── Tool: fio, cargo examples

Results:
├── Sequential I/O: 16.5 GB/s ✓ (claim: 12.5 GB/s)
├── Random IOPS: 249K ✓ (claim: 228K)
├── Compression: 30.66 GB/s ✓ (claim: 20-24 GB/s)
├── Ratio: 3.87x ✓ (claim: 3.7x)
├── mlock: 272 MB ✓ (claim: >100 MB)
├── Stress test: No deadlock ✓
└── Status: 6/8 PASS (GPU claims deprecated)

Why trueno-zram is Faster

SIMD vectorization: AVX-512 processes 64 bytes per instruction vs byte-by-byte
Parallel compression: All CPU cores compress simultaneously via rayon
Batch amortization: Setup costs spread across thousands of pages
Cache efficiency: Sequential memory access patterns
Zero-copy paths: Same-fill pages detected without compression attempt

Tuning Guide

Algorithm Selection

LZ4 vs ZSTD

Workload	Recommended	Reason
Real-time	LZ4	Lowest latency
Memory-constrained	ZSTD	Better ratio
Mixed content	Adaptive	Auto-selects
Highly compressible	ZSTD-3	Best ratio

Code Example

#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};

// For latency-sensitive workloads
let fast = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;

// For memory-constrained systems
let compact = CompressorBuilder::new()
    .algorithm(Algorithm::Zstd { level: 3 })
    .build()?;

// For unknown workloads
let adaptive = CompressorBuilder::new()
    .algorithm(Algorithm::Adaptive)
    .build()?;
}

SIMD Backend Selection

The library auto-detects the best backend, but you can force one:

#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, SimdBackend};

// Force AVX2 (e.g., for testing)
let compressor = CompressorBuilder::new()
    .prefer_backend(SimdBackend::Avx2)
    .build()?;
}

GPU Batch Sizing

Optimal Batch Size

GPU	L2 Cache	Optimal Batch
H100	50 MB	10,240 pages
A100	40 MB	8,192 pages
RTX 4090	72 MB	14,745 pages
RTX 3090	6 MB	1,200 pages

Calculation

#![allow(unused)]
fn main() {
fn optimal_batch_size(l2_cache_bytes: usize) -> usize {
    // Each page needs ~3KB working memory
    // Target 80% L2 cache utilization
    (l2_cache_bytes * 80 / 100) / (3 * 1024)
}
}

Async DMA

Enable async DMA for overlapping transfers:

#![allow(unused)]
fn main() {
let config = GpuBatchConfig {
    async_dma: true,
    ring_buffer_slots: 4,  // Pipeline depth
    ..Default::default()
};
}

Benefits:

Overlaps H2D, compute, D2H
20-30% throughput improvement
Higher GPU utilization

Memory Configuration

Working Memory

#![allow(unused)]
fn main() {
// Per-thread memory usage
const LZ4_HASH_TABLE: usize = 64 * 1024;   // 64 KB
const ZSTD_CONTEXT: usize = 256 * 1024;     // 256 KB
const WORKING_BUFFER: usize = 16 * 1024;    // 16 KB
}

Reducing Memory

For memory-constrained systems:

#![allow(unused)]
fn main() {
// Use LZ4 (smaller context)
.algorithm(Algorithm::Lz4)

// Smaller batch sizes
let config = GpuBatchConfig {
    batch_size: 100,
    ..Default::default()
};
}

Monitoring Performance

#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;

// Process pages...

let stats = compressor.stats();

// Check throughput
if stats.throughput_gbps() < 3.0 {
    println!("Warning: Low throughput, check CPU affinity");
}

// Check ratio
if stats.ratio() < 1.5 {
    println!("Warning: Low compression, data may be incompressible");
}
}

CPU Affinity

For best performance, pin compression threads to physical cores:

# Linux: Pin to cores 0-3
taskset -c 0-3 ./my_app

# Check NUMA topology
numactl --hardware

Kernel Parameters

For zram integration:

# Increase zram size
echo $((8 * 1024 * 1024 * 1024)) > /sys/block/zram0/disksize

# Set compression streams
echo 4 > /sys/block/zram0/max_comp_streams

# Enable writeback
echo 1 > /sys/block/zram0/writeback

PCIe 5x Rule

The PCIe 5x rule determines when GPU offload is beneficial for compression.

The Rule

GPU beneficial when: T_compute > 5 × T_transfer

Where:

T_compute = CPU computation time
T_transfer = PCIe transfer time (H2D + D2H)

Why 5x?

GPU offload has overhead:

H2D transfer: Copy data to GPU
Kernel launch: ~5-10us overhead
D2H transfer: Copy results back
Synchronization: Wait for completion

The 5x factor accounts for these overheads and ensures GPU provides net benefit.

Calculation

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::meets_pcie_rule;
use trueno_zram_core::PAGE_SIZE;

fn check_gpu_benefit(
    pages: usize,
    pcie_bandwidth_gbps: f64,
    gpu_throughput_gbps: f64,
) -> bool {
    let data_bytes = pages * PAGE_SIZE;

    // Transfer time (round trip)
    let transfer_time = 2.0 * data_bytes as f64 / (pcie_bandwidth_gbps * 1e9);

    // GPU compute time
    let gpu_time = data_bytes as f64 / (gpu_throughput_gbps * 1e9);

    // CPU compute time (assume 4 GB/s baseline)
    let cpu_time = data_bytes as f64 / (4e9);

    // GPU beneficial if saves time
    cpu_time > (transfer_time + gpu_time) * 1.2  // 20% margin
}
}

Examples

PCIe 4.0 x16 (25 GB/s)

Batch	Data Size	Transfer	GPU Time	Beneficial?
100	400 KB	32 us	4 us	No
1,000	4 MB	320 us	40 us	Marginal
10,000	40 MB	3.2 ms	400 us	Yes
100,000	400 MB	32 ms	4 ms	Yes

PCIe 5.0 x16 (64 GB/s)

Batch	Data Size	Transfer	GPU Time	Beneficial?
1,000	4 MB	125 us	40 us	No
10,000	40 MB	1.25 ms	400 us	Yes
100,000	400 MB	12.5 ms	4 ms	Yes

Checking in Code

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{GpuBatchCompressor, GpuBatchConfig};

let mut compressor = GpuBatchCompressor::new(config)?;
let result = compressor.compress_batch(&pages)?;

if result.pcie_rule_satisfied() {
    println!("GPU offload was beneficial");
    println!("Kernel time: {} ns", result.kernel_time_ns);
    println!("Transfer time: {} ns",
        result.h2d_time_ns + result.d2h_time_ns);
} else {
    println!("Consider using CPU for this batch size");
}
}

Backend Selection

#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{select_backend, BackendSelection, gpu_available};

let batch_size = 5000;
let has_gpu = gpu_available();

match select_backend(batch_size, has_gpu) {
    BackendSelection::Gpu => {
        // Use GPU batch compression
    }
    BackendSelection::Simd => {
        // Use CPU SIMD compression
    }
    BackendSelection::Scalar => {
        // Use scalar compression
    }
}
}

Optimization Tips

Batch larger: Combine small batches into larger ones
Use async DMA: Overlap transfers with computation
Profile first: Measure actual transfer times
Consider hybrid: Use CPU for small batches, GPU for large

Hardware-Specific Thresholds

GPU	PCIe	Min Beneficial Batch
H100	5.0 x16	5,000 pages
A100	4.0 x16	8,000 pages
RTX 4090	4.0 x16	10,000 pages
RTX 3090	4.0 x16	12,000 pages

Design Overview

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Public API                           │
│  CompressorBuilder, Algorithm, CompressedPage           │
├─────────────────────────────────────────────────────────┤
│                 Algorithm Selection                      │
│  ┌─────────┐  ┌─────────┐  ┌──────────┐  ┌─────────┐   │
│  │   LZ4   │  │  ZSTD   │  │ Adaptive │  │Samefill │   │
│  └────┬────┘  └────┬────┘  └────┬─────┘  └────┬────┘   │
├───────┼────────────┼────────────┼─────────────┼────────┤
│       │     SIMD Dispatch       │             │         │
│  ┌────▼────┐  ┌────▼────┐  ┌───▼───┐  ┌─────▼─────┐   │
│  │ AVX-512 │  │  AVX2   │  │ NEON  │  │  Scalar   │   │
│  └─────────┘  └─────────┘  └───────┘  └───────────┘   │
├─────────────────────────────────────────────────────────┤
│                   GPU Backend (Optional)                 │
│  ┌─────────────────────────────────────────────────┐   │
│  │  CUDA Batch Compressor (trueno-gpu PTX)         │   │
│  │  ├── H2D Transfer                               │   │
│  │  ├── Warp-Cooperative LZ4 Kernel                │   │
│  │  └── D2H Transfer                               │   │
│  └─────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘

Crate Structure

trueno-zram/
├── crates/
│   ├── trueno-zram-core/     # Core compression library
│   │   ├── src/
│   │   │   ├── lib.rs        # Public API
│   │   │   ├── error.rs      # Error types
│   │   │   ├── page.rs       # CompressedPage
│   │   │   ├── lz4/          # LZ4 implementation
│   │   │   ├── zstd/         # ZSTD implementation
│   │   │   ├── gpu/          # GPU batch compression
│   │   │   ├── simd/         # SIMD detection/dispatch
│   │   │   ├── samefill.rs   # Same-fill detection
│   │   │   ├── compat.rs     # Kernel compatibility
│   │   │   └── benchmark.rs  # Benchmarking utilities
│   │   └── examples/
│   ├── trueno-zram-adaptive/ # ML-driven selection
│   ├── trueno-zram-generator/# systemd integration
│   └── trueno-zram-cli/      # CLI tool
└── bins/
    └── trueno-ublk/          # ublk daemon

Key Design Decisions

1. Runtime SIMD Dispatch

CPU features are detected at runtime, not compile time:

#![allow(unused)]
fn main() {
// Detection happens once at startup
let features = simd::detect();

// Dispatch based on available features
if features.has_avx512() {
    lz4::avx512::compress(input, output)
} else if features.has_avx2() {
    lz4::avx2::compress(input, output)
} else {
    lz4::scalar::compress(input, output)
}
}

2. Page-Based Compression

All compression operates on fixed 4KB pages:

#![allow(unused)]
fn main() {
pub const PAGE_SIZE: usize = 4096;

// This is enforced at the type level
pub fn compress(page: &[u8; PAGE_SIZE]) -> Result<CompressedPage>;
}

3. Builder Pattern

Configuration via builder pattern:

#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .prefer_backend(SimdBackend::Avx512)
    .build()?;
}

4. Trait-Based Abstraction

The PageCompressor trait enables polymorphism:

#![allow(unused)]
fn main() {
pub trait PageCompressor {
    fn compress(&self, page: &[u8; PAGE_SIZE]) -> Result<CompressedPage>;
    fn decompress(&self, page: &CompressedPage) -> Result<[u8; PAGE_SIZE]>;
}
}

5. Zero-Copy Where Possible

Minimize allocations in hot paths:

#![allow(unused)]
fn main() {
// Output buffer passed in, not allocated
fn compress_into(input: &[u8], output: &mut [u8]) -> Result<usize>;
}

6. No Panics in Library Code

All errors are returned as Result:

#![allow(unused)]
#![deny(clippy::panic)]
#![deny(clippy::unwrap_used)]
fn main() {
}

Dependencies

Crate	Purpose
`thiserror`	Error derive macros
`cudarc`	CUDA driver bindings
`rayon`	Parallel iteration
`trueno-gpu`	Pure Rust PTX generation

Feature Flags

Flag	Default	Description
`std`	Yes	Standard library
`nightly`	No	Nightly SIMD features
`cuda`	No	CUDA GPU support

SIMD Dispatch

Overview

trueno-zram uses runtime CPU feature detection to select the optimal SIMD implementation.

Detection

#![allow(unused)]
fn main() {
// src/simd/detect.rs

pub fn detect() -> SimdFeatures {
    #[cfg(target_arch = "x86_64")]
    {
        SimdFeatures {
            avx512: is_x86_feature_detected!("avx512f")
                && is_x86_feature_detected!("avx512bw")
                && is_x86_feature_detected!("avx512vl"),
            avx2: is_x86_feature_detected!("avx2"),
            sse42: is_x86_feature_detected!("sse4.2"),
        }
    }

    #[cfg(target_arch = "aarch64")]
    {
        SimdFeatures {
            neon: true, // Always available on AArch64
        }
    }
}
}

Dispatch Pattern

#![allow(unused)]
fn main() {
// src/simd/dispatch.rs

pub fn compress(input: &[u8], output: &mut [u8]) -> Result<usize> {
    let features = detect();

    #[cfg(target_arch = "x86_64")]
    {
        if features.has_avx512() {
            return unsafe { avx512::compress(input, output) };
        }
        if features.has_avx2() {
            return unsafe { avx2::compress(input, output) };
        }
    }

    #[cfg(target_arch = "aarch64")]
    {
        if features.has_neon() {
            return unsafe { neon::compress(input, output) };
        }
    }

    scalar::compress(input, output)
}
}

Implementation Structure

Each algorithm has separate implementations:

lz4/
├── mod.rs          # Public API, dispatch
├── compress.rs     # Core algorithm logic
├── decompress.rs   # Decompression
├── avx512.rs       # AVX-512 specialization
├── avx2.rs         # AVX2 specialization
├── neon.rs         # ARM NEON specialization
└── scalar.rs       # Fallback implementation

AVX-512 Implementation

#![allow(unused)]
fn main() {
// lz4/avx512.rs

#[target_feature(enable = "avx512f,avx512bw,avx512vl")]
pub unsafe fn compress(input: &[u8], output: &mut [u8]) -> Result<usize> {
    // 64-byte hash table lookups
    let hash_chunk = _mm512_loadu_si512(input.as_ptr());

    // Parallel match finding
    let matches = _mm512_cmpeq_epi32(hash_chunk, pattern);

    // Wide literal copies
    _mm512_storeu_si512(output.as_mut_ptr(), data);

    // ...
}
}

AVX2 Implementation

#![allow(unused)]
fn main() {
// lz4/avx2.rs

#[target_feature(enable = "avx2")]
pub unsafe fn compress(input: &[u8], output: &mut [u8]) -> Result<usize> {
    // 32-byte operations
    let chunk = _mm256_loadu_si256(input.as_ptr());

    // Match extension
    let cmp = _mm256_cmpeq_epi8(src, dst);
    let mask = _mm256_movemask_epi8(cmp);

    // ...
}
}

NEON Implementation

#![allow(unused)]
fn main() {
// lz4/neon.rs

#[cfg(target_arch = "aarch64")]
pub unsafe fn compress(input: &[u8], output: &mut [u8]) -> Result<usize> {
    use std::arch::aarch64::*;

    // 16-byte operations
    let chunk = vld1q_u8(input.as_ptr());

    // Parallel comparison
    let cmp = vceqq_u8(src, dst);

    // ...
}
}

Benchmarking Dispatch

#![allow(unused)]
fn main() {
use trueno_zram_core::simd::{detect, SimdBackend};

fn benchmark_all_backends() {
    let features = detect();
    let input = [0xAA; PAGE_SIZE];
    let mut output = [0u8; PAGE_SIZE * 2];

    // Benchmark available backends
    if features.has_avx512() {
        bench("AVX-512", || avx512::compress(&input, &mut output));
    }
    if features.has_avx2() {
        bench("AVX2", || avx2::compress(&input, &mut output));
    }
    bench("Scalar", || scalar::compress(&input, &mut output));
}
}

Compile-Time Optimization

For maximum performance, enable target features:

# .cargo/config.toml
[build]
rustflags = ["-C", "target-cpu=native"]

Or for specific features:

rustflags = ["-C", "target-feature=+avx2,+avx512f,+avx512bw"]

GPU Pipeline

Overview

The GPU pipeline uses CUDA for batch compression with async DMA.

┌────────────┐    ┌────────────┐    ┌────────────┐    ┌────────────┐
│   Host     │───▶│    H2D     │───▶│   Kernel   │───▶│    D2H     │
│   Memory   │    │  Transfer  │    │  Execution │    │  Transfer  │
└────────────┘    └────────────┘    └────────────┘    └────────────┘
      ▲                                                      │
      └──────────────────────────────────────────────────────┘

Pipeline Stages

1. Host-to-Device Transfer

#![allow(unused)]
fn main() {
fn transfer_to_device(&self, pages: &[[u8; PAGE_SIZE]]) -> Result<CudaSlice<u8>> {
    // Flatten pages into contiguous buffer
    let total_bytes = pages.len() * PAGE_SIZE;
    let mut flat_data = Vec::with_capacity(total_bytes);
    for page in pages {
        flat_data.extend_from_slice(page);
    }

    // Async copy to device
    let device_buffer = self.stream.clone_htod(&flat_data)?;
    Ok(device_buffer)
}
}

2. Kernel Execution

#![allow(unused)]
fn main() {
fn execute_kernel(&self, input: &CudaSlice<u8>, batch_size: u32) -> Result<CudaSlice<u8>> {
    // Allocate output buffer
    let mut output = self.stream.alloc_zeros::<u8>(batch_size as usize * PAGE_SIZE)?;
    let mut sizes = self.stream.alloc_zeros::<u32>(batch_size as usize)?;

    // Launch kernel
    // Grid: ceil(batch_size / 4) blocks
    // Block: 128 threads (4 warps)
    unsafe {
        self.stream
            .launch_builder(&self.kernel_fn)
            .arg(&input)
            .arg(&mut output)
            .arg(&mut sizes)
            .arg(&batch_size)
            .launch(cfg)?;
    }

    self.stream.synchronize()?;
    Ok(output)
}
}

3. Device-to-Host Transfer

#![allow(unused)]
fn main() {
fn transfer_from_device(&self, data: CudaSlice<u8>) -> Result<Vec<CompressedPage>> {
    let output = self.stream.clone_dtoh(&data)?;

    // Convert to CompressedPage structures
    // ...
}
}

Warp-Cooperative Kernel

The LZ4 kernel uses warp-cooperative compression:

Block (128 threads)
├── Warp 0 (32 threads) → Page 0
├── Warp 1 (32 threads) → Page 1
├── Warp 2 (32 threads) → Page 2
└── Warp 3 (32 threads) → Page 3

Shared Memory Layout

Shared Memory (48 KB per block)
├── Warp 0: 12 KB (hash table + working)
├── Warp 1: 12 KB
├── Warp 2: 12 KB
└── Warp 3: 12 KB

PTX Generation

Using trueno-gpu for pure Rust PTX:

#![allow(unused)]
fn main() {
use trueno_gpu::kernels::lz4::Lz4WarpCompressKernel;
use trueno_gpu::kernels::Kernel;

let kernel = Lz4WarpCompressKernel::new(65536);
let ptx_string = kernel.emit_ptx();

// Load PTX into CUDA module
let ptx = Ptx::from(ptx_string);
let module = context.load_module(ptx)?;
}

Async DMA Ring Buffer

#![allow(unused)]
fn main() {
struct AsyncPipeline {
    slots: Vec<PipelineSlot>,
    head: usize,
    tail: usize,
}

struct PipelineSlot {
    input_buffer: CudaSlice<u8>,
    output_buffer: CudaSlice<u8>,
    event: CudaEvent,
    state: SlotState,
}

enum SlotState {
    Free,
    H2DInProgress,
    KernelInProgress,
    D2HInProgress,
    Complete,
}
}

Pipelining

Time ──────────────────────────────────────────────▶

Slot 0: [H2D][Kernel][D2H]
Slot 1:      [H2D][Kernel][D2H]
Slot 2:           [H2D][Kernel][D2H]
Slot 3:                [H2D][Kernel][D2H]

Performance Optimization

1. Pinned Memory

#![allow(unused)]
fn main() {
// Use pinned (page-locked) memory for faster transfers
let pinned_buffer = cuda_malloc_host(size)?;
}

2. Stream Overlap

#![allow(unused)]
fn main() {
// Use separate streams for H2D, compute, D2H
let h2d_stream = context.create_stream()?;
let compute_stream = context.create_stream()?;
let d2h_stream = context.create_stream()?;
}

3. Kernel Occupancy

#![allow(unused)]
fn main() {
// Optimal: 4 warps per block = 128 threads
// Shared memory: 48 KB (12 KB per warp)
// Registers: ~32 per thread
}

Error Handling

#![allow(unused)]
fn main() {
match kernel_result {
    Ok(output) => process_output(output),
    Err(CudaError::IllegalAddress) => {
        // Memory access violation
        fallback_to_cpu()?
    }
    Err(CudaError::LaunchFailed) => {
        // Kernel launch failed
        fallback_to_cpu()?
    }
    Err(e) => Err(Error::GpuNotAvailable(e.to_string())),
}
}

trueno-ublk Overview

trueno-ublk is a GPU-accelerated ZRAM replacement that uses the Linux ublk interface to provide a high-performance compressed block device in userspace.

Production Status (2026-01-06)

MILESTONE DT-005 ACHIEVED: trueno-ublk is running as system swap!

8GB device active as primary swap (priority 150)
CPU SIMD compression at 20-30 GB/s with 3.87x ratio
CPU parallel decompression at 50+ GB/s

DT-007 COMPLETED: Swap deadlock issue FIXED via mlock() - daemon memory pinned.

Known Limitations:

Docker cannot isolate ublk devices (host kernel resources)

What is ublk?

ublk (userspace block device) is a Linux kernel interface that allows implementing block devices entirely in userspace. It uses io_uring for efficient I/O handling, avoiding the overhead of kernel context switches.

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Applications                          │
├─────────────────────────────────────────────────────────┤
│                    Filesystem                            │
├─────────────────────────────────────────────────────────┤
│                /dev/ublkbN block device                  │
├─────────────────────────────────────────────────────────┤
│                    Linux Kernel                          │
│                   (ublk driver)                          │
├─────────────────────────────────────────────────────────┤
│                    io_uring                              │
├─────────────────────────────────────────────────────────┤
│                  trueno-ublk                             │
│  ┌───────────────────────────────────────────────────┐  │
│  │              Entropy Router                        │  │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────────────┐   │  │
│  │  │  GPU    │  │  SIMD   │  │     Scalar      │   │  │
│  │  │ (batch) │  │ (AVX2)  │  │ (incompressible)│   │  │
│  │  └─────────┘  └─────────┘  └─────────────────┘   │  │
│  └───────────────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────┐  │
│  │          trueno-zram-core (LZ4/ZSTD)              │  │
│  └───────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

Features

SIMD-Accelerated Compression

Uses trueno-zram-core for vectorized LZ4/ZSTD compression:

AVX2 (256-bit) on modern x86_64
AVX-512 (512-bit) on supported CPUs
NEON (128-bit) on ARM64

GPU Batch Compression

Offloads compression to CUDA GPUs when beneficial:

Warp-cooperative LZ4 kernel
PCIe 5x rule evaluation
Async DMA for overlap

Entropy-Based Routing

Analyzes data entropy to choose the optimal compression path:

Low entropy (< 4.0 bits): GPU batch compression
Medium entropy (4-7 bits): SIMD compression
High entropy (> 7.0 bits): Scalar or skip compression

Zero-Page Deduplication

Automatically detects and deduplicates all-zero pages, achieving 2048:1 compression for sparse data.

zram-Compatible Statistics

Exports statistics in the same format as kernel zram, enabling drop-in monitoring compatibility.

CLI Usage

# Create a 1TB compressed RAM disk
trueno-ublk create -s 1T -a lz4 --gpu

# List devices
trueno-ublk list

# Show statistics
trueno-ublk stat /dev/ublkb0

# Interactive dashboard
trueno-ublk top

# Analyze data entropy
trueno-ublk entropy /dev/ublkb0

Requirements

Linux kernel 6.0+ with ublk support
CAP_SYS_ADMIN capability (for device creation)
Optional: NVIDIA GPU with CUDA support

Block Device API

trueno-ublk provides a BlockDevice type for creating compressed block devices in pure Rust, without requiring kernel privileges.

Basic Usage

#![allow(unused)]
fn main() {
use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder, PAGE_SIZE};

// Create a compressor
let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;

// Create a 1GB block device
let mut device = BlockDevice::new(1 << 30, compressor);

// Write data (must be page-aligned)
let data = vec![0xAB; PAGE_SIZE];
device.write(0, &data)?;

// Read back
let mut buf = vec![0u8; PAGE_SIZE];
device.read(0, &mut buf)?;
assert_eq!(data, buf);
}

Page Size

All I/O operations must be aligned to PAGE_SIZE (4096 bytes):

#![allow(unused)]
fn main() {
use trueno_zram_core::PAGE_SIZE;

// Write at page boundaries
device.write(0, &data)?;                    // Page 0
device.write(PAGE_SIZE as u64, &data)?;     // Page 1
device.write(2 * PAGE_SIZE as u64, &data)?; // Page 2
}

Statistics

Track compression performance with BlockDeviceStats:

#![allow(unused)]
fn main() {
let stats = device.stats();

println!("Pages stored:      {}", stats.pages_stored);
println!("Bytes written:     {}", stats.bytes_written);
println!("Bytes compressed:  {}", stats.bytes_compressed);
println!("Compression ratio: {:.2}x", stats.compression_ratio());
println!("Zero pages:        {}", stats.zero_pages);
println!("GPU pages:         {}", stats.gpu_pages);
println!("SIMD pages:        {}", stats.simd_pages);
println!("Scalar pages:      {}", stats.scalar_pages);
}

Entropy Threshold

Configure entropy-based routing with with_entropy_threshold:

#![allow(unused)]
fn main() {
// Use threshold of 7.0 bits/byte
let device = BlockDevice::with_entropy_threshold(
    1 << 30,         // 1GB
    compressor,
    7.0,             // Entropy threshold
);
}

Data with entropy above this threshold is routed to the scalar path (assumed incompressible).

Discard Operation

Free pages that are no longer needed:

#![allow(unused)]
fn main() {
// Discard page 0
device.discard(0, PAGE_SIZE as u64)?;

// Reading discarded pages returns zeros
let mut buf = vec![0xFFu8; PAGE_SIZE];
device.read(0, &mut buf)?;
assert!(buf.iter().all(|&b| b == 0));
}

UblkDevice (Kernel Interface)

For creating actual block devices visible to the system, use UblkDevice:

#![allow(unused)]
fn main() {
use trueno_ublk::{UblkDevice, DeviceConfig};
use trueno_zram_core::Algorithm;

let config = DeviceConfig {
    size: 1 << 40,  // 1TB
    algorithm: Algorithm::Lz4,
    streams: 4,
    gpu_enabled: true,
    mem_limit: Some(8 << 30),  // 8GB RAM limit
    backing_dev: None,
    writeback_limit: None,
    entropy_skip_threshold: 7.5,
    gpu_batch_size: 1024,
};

// Requires CAP_SYS_ADMIN
let device = UblkDevice::create(config)?;
println!("Created device: /dev/ublkb{}", device.id());
}

DeviceStats

The DeviceStats struct provides zram-compatible statistics:

#![allow(unused)]
fn main() {
let stats = device.stats();

// mm_stat fields (zram compatible)
println!("Original data:    {} bytes", stats.orig_data_size);
println!("Compressed data:  {} bytes", stats.compr_data_size);
println!("Memory used:      {} bytes", stats.mem_used_total);
println!("Same pages:       {}", stats.same_pages);
println!("Huge pages:       {}", stats.huge_pages);

// io_stat fields
println!("Failed reads:     {}", stats.failed_reads);
println!("Failed writes:    {}", stats.failed_writes);

// trueno-ublk extensions
println!("GPU pages:        {}", stats.gpu_pages);
println!("SIMD pages:       {}", stats.simd_pages);
println!("Throughput:       {:.2} GB/s", stats.throughput_gbps);
println!("Avg entropy:      {:.2} bits", stats.avg_entropy);
println!("SIMD backend:     {}", stats.simd_backend);
}

Entropy Routing

trueno-ublk uses Shannon entropy analysis to route data to the optimal compression backend.

Shannon Entropy

Shannon entropy measures the information density of data in bits per byte. The formula is:

H(X) = -Σ p(x) * log2(p(x))

Where p(x) is the probability of each byte value occurring.

Entropy (bits/byte)	Data Type	Compressibility
0.0	All same value	Extremely high
1.0-3.0	Simple patterns	Very high
4.0-6.0	Text, code	Good
7.0-7.5	Compressed data	Poor
8.0	Random/encrypted	None

Routing Strategy

trueno-ublk routes pages based on their entropy:

┌──────────────────────────────────────────────────────────┐
│                    Incoming Page                          │
│                         │                                 │
│                    Calculate Entropy                      │
│                         │                                 │
│           ┌─────────────┼─────────────┐                  │
│           ▼             ▼             ▼                  │
│      entropy < 4.0  4.0 ≤ e ≤ 7.0  entropy > 7.0        │
│           │             │             │                  │
│           ▼             ▼             ▼                  │
│      GPU Batch      SIMD Path    Scalar/Skip            │
│    (highly comp.)  (normal data)  (incompress.)         │
└──────────────────────────────────────────────────────────┘

GPU Batch Path (entropy < 4.0)

Used for highly compressible data
Benefits from GPU parallelism
Batches multiple pages for efficiency
Best for: zeros, repeating patterns, sparse data

SIMD Path (4.0 ≤ entropy ≤ 7.0)

Used for typical data
AVX2/AVX-512/NEON acceleration
Single-page processing
Best for: text, code, structured data

Scalar/Skip Path (entropy > 7.0)

Used for incompressible data
Avoids wasting CPU cycles
May store uncompressed
Best for: encrypted data, already-compressed media

Configuration

Set the entropy threshold when creating a device:

#![allow(unused)]
fn main() {
use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder};

let compressor = CompressorBuilder::new()
    .algorithm(Algorithm::Lz4)
    .build()?;

// Lower threshold = more aggressive SIMD usage
let device = BlockDevice::with_entropy_threshold(
    1 << 30,    // Size
    compressor,
    6.5,        // Custom threshold (default is 7.0)
);
}

Monitoring Entropy

Use the CLI to analyze device entropy:

# Show entropy distribution
trueno-ublk entropy /dev/ublkb0

# Output:
# Entropy Distribution:
#   0.0-2.0 bits: ████████████████████ 45% (GPU)
#   2.0-4.0 bits: ████████ 18% (GPU)
#   4.0-6.0 bits: ██████████ 22% (SIMD)
#   6.0-7.0 bits: ████ 10% (SIMD)
#   7.0-8.0 bits: ██ 5% (Scalar)
#
# Average entropy: 3.2 bits/byte
# Recommended threshold: 7.0

Statistics

Track routing decisions via stats:

#![allow(unused)]
fn main() {
let stats = device.stats();

println!("Routing Statistics:");
println!("  GPU pages:    {} ({:.1}%)",
    stats.gpu_pages,
    100.0 * stats.gpu_pages as f64 / stats.pages_stored as f64);
println!("  SIMD pages:   {} ({:.1}%)",
    stats.simd_pages,
    100.0 * stats.simd_pages as f64 / stats.pages_stored as f64);
println!("  Scalar pages: {} ({:.1}%)",
    stats.scalar_pages,
    100.0 * stats.scalar_pages as f64 / stats.pages_stored as f64);
println!("  Zero pages:   {} ({:.1}%)",
    stats.zero_pages,
    100.0 * stats.zero_pages as f64 / stats.pages_stored as f64);
}

Zero-Page Optimization

All-zero pages receive special handling regardless of entropy:

#![allow(unused)]
fn main() {
// Zero pages are detected and deduplicated
let zeros = vec![0u8; PAGE_SIZE];
device.write(0, &zeros)?;
device.write(PAGE_SIZE as u64, &zeros)?;
device.write(2 * PAGE_SIZE as u64, &zeros)?;

let stats = device.stats();
// All three pages share the same zero-page representation
assert_eq!(stats.zero_pages, 3);
}

This achieves effective 2048:1 compression for sparse data like freshly-allocated memory.

Visualization & Observability

trueno-ublk v3.17.0 integrates with the renacer visualization framework for real-time monitoring, benchmarking, and distributed tracing.

Overview

The visualization system (VIZ-001/002/003/004) provides:

Real-time TUI dashboard - Monitor throughput, IOPS, and tier distribution
JSON/HTML reports - Export benchmark results for analysis
OTLP integration - Distributed tracing to Jaeger/Tempo

CLI Flags

Real-time Visualization (VIZ-002)

# Launch TUI dashboard (requires foreground mode)
sudo trueno-ublk create --size 8G --backend tiered \
    --visualize \
    --foreground

Benchmark Reports (VIZ-003)

# Text output (default)
trueno-ublk benchmark --size 4G --format text

# JSON output (machine-readable)
trueno-ublk benchmark --size 4G --format json > results.json

# HTML report with charts
trueno-ublk benchmark --size 4G --format html -o report.html

# Include ML anomaly detection
trueno-ublk benchmark --size 4G --format json --ml-anomaly

OTLP Tracing (VIZ-004)

# Export traces to Jaeger
sudo trueno-ublk create --size 8G \
    --otlp-endpoint http://localhost:4317 \
    --otlp-service-name trueno-ublk

JSON Schema

The benchmark JSON output follows the trueno-renacer-v1 schema:

{
  "version": "3.17.0",
  "format": "trueno-renacer-v1",
  "benchmark": {
    "workload": "sequential",
    "duration_sec": 60,
    "size_bytes": 4294967296,
    "backend": "tiered",
    "iterations": 3
  },
  "metrics": {
    "throughput_gbps": 7.9,
    "iops": 666000,
    "compression_ratio": 2.8,
    "same_fill_pages": 1048576,
    "tier_distribution": {
      "kernel_zram": 0.65,
      "simd_zstd": 0.25,
      "nvme_direct": 0.05,
      "same_fill": 0.05
    },
    "entropy_histogram": {
      "p50": 4.2,
      "p75": 5.8,
      "p90": 6.9,
      "p95": 7.3,
      "p99": 7.8
    }
  },
  "ml_analysis": {
    "anomalies": [],
    "clusters": 3,
    "silhouette_score": 0.82
  }
}

TruenoCollector API

For programmatic access, use TruenoCollector which implements renacer’s Collector trait:

#![allow(unused)]
fn main() {
use trueno_ublk::visualize::TruenoCollector;
use renacer::visualize::collectors::{Collector, MetricValue};
use std::sync::Arc;

// Create collector from TieredPageStore
let collector = TruenoCollector::new(Arc::clone(&store));

// Collect metrics
let metrics = collector.collect()?;

// Access individual metrics
if let Some(MetricValue::Gauge(throughput)) = metrics.get("throughput_gbps") {
    println!("Throughput: {:.1} GB/s", throughput);
}
}

Metrics Reference

Metric	Type	Description
`throughput_gbps`	Gauge	Current I/O throughput in GB/s
`iops`	Rate	Operations per second
`compression_ratio`	Gauge	Overall compression ratio
`pages_total`	Counter	Total pages stored
`same_fill_pages`	Counter	Pages detected as same-fill
`tier_kernel_zram_pct`	Gauge	% of pages in kernel ZRAM tier
`tier_simd_pct`	Gauge	% of pages in SIMD ZSTD tier
`tier_skip_pct`	Gauge	% of pages skipping compression
`tier_samefill_pct`	Gauge	% of pages detected as same-fill

Dashboard Panels

When using --visualize, the TUI displays:

Panel	Description
Tier Heatmap	Real-time routing decisions
Throughput Gauge	Current GB/s with history
IOPS Counter	Operations per second
Entropy Timeline	Data compressibility over time
Ratio Trend	Compression ratio history

Example: Benchmark Workflow

# 1. Run benchmark with JSON output
trueno-ublk benchmark --size 4G --format json \
    --workload mixed --iterations 5 > bench.json

# 2. Generate HTML report
trueno-ublk benchmark --size 4G --format html \
    --workload mixed -o benchmark-report.html

# 3. Analyze with jq
cat bench.json | jq '.metrics.throughput_gbps'

# 4. Compare algorithms
trueno-ublk benchmark --size 4G --backend memory --format json > lz4.json
trueno-ublk benchmark --size 4G --backend tiered --format json > tiered.json

Integration with Jaeger

For distributed tracing:

# Start Jaeger (all-in-one)
docker run -d --name jaeger \
    -p 4317:4317 \
    -p 16686:16686 \
    jaegertracing/all-in-one:latest

# Create device with OTLP tracing
sudo trueno-ublk create --size 8G \
    --backend tiered \
    --otlp-endpoint http://localhost:4317 \
    --otlp-service-name trueno-swap

# View traces at http://localhost:16686

trueno-ublk Examples

trueno-ublk includes several examples demonstrating its features.

Running Examples

# Basic block device usage
cargo run --example block_device -p trueno-ublk

# Compression statistics comparison
cargo run --example compression_stats -p trueno-ublk

# Entropy-based routing demonstration
cargo run --example entropy_routing -p trueno-ublk

# v3.17.0 Examples
# ----------------

# Visualization demo (VIZ-001/002/003/004)
cargo run --example visualization_demo -p trueno-ublk

# ZSTD vs LZ4 performance comparison (use --release for accurate benchmarks)
cargo run --example zstd_vs_lz4 -p trueno-ublk --release

# Tiered storage architecture demo (KERN-001/002/003)
cargo run --example tiered_storage -p trueno-ublk

# Batched compression benchmark
cargo run --example batched_benchmark -p trueno-ublk --release

Example: Basic Block Device

Demonstrates creating a compressed block device, writing data, and reading it back.

use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder, PAGE_SIZE};

fn main() -> anyhow::Result<()> {
    // Create an LZ4 compressor
    let compressor = CompressorBuilder::new()
        .algorithm(Algorithm::Lz4)
        .build()?;

    // Create a 64MB block device
    let mut device = BlockDevice::new(64 * 1024 * 1024, compressor);

    // Write different patterns
    let compressible = vec![0xAA; PAGE_SIZE];
    device.write(0, &compressible)?;

    let zeros = vec![0u8; PAGE_SIZE];
    device.write(PAGE_SIZE as u64, &zeros)?;

    // Read back and verify
    let mut buf = vec![0u8; PAGE_SIZE];
    device.read(0, &mut buf)?;
    assert_eq!(buf, compressible);

    // Check statistics
    let stats = device.stats();
    println!("Compression ratio: {:.2}x", stats.compression_ratio());

    Ok(())
}

Output:

trueno-ublk Block Device Example
=================================

Created LZ4 compressor with SIMD backend
Created block device: 64 MB

Writing test patterns...
  Page 0: Highly compressible (all 0xAA)
  Page 1: Zero page
  Page 2: Sequential data
  Page 3: Pseudo-random data

Reading back and verifying...
  Page 0: OK
  Page 1: OK
  Page 2: OK
  Page 3: OK

Device Statistics:
  Pages stored:      4
  Bytes written:     16 KB
  Bytes compressed:  0 KB
  Compression ratio: 27.86x
  Zero pages:        1

All tests passed!

Example: Compression Statistics

Compares LZ4 and ZSTD compression on various data patterns.

use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder, PAGE_SIZE};

fn main() -> anyhow::Result<()> {
    let test_data = vec![
        ("All zeros", vec![0u8; PAGE_SIZE]),
        ("Repeating pattern", (0..PAGE_SIZE).map(|i| (i % 4) as u8).collect()),
        ("High entropy", (0..PAGE_SIZE).map(|i| ((i * 17 + 31) % 256) as u8).collect()),
    ];

    for algorithm in [Algorithm::Lz4, Algorithm::Zstd { level: 3 }] {
        let compressor = CompressorBuilder::new()
            .algorithm(algorithm)
            .build()?;

        let mut device = BlockDevice::new(64 * 1024 * 1024, compressor);

        for (i, (_, data)) in test_data.iter().enumerate() {
            device.write((i * PAGE_SIZE) as u64, data)?;
        }

        let stats = device.stats();
        println!("{:?}: {:.2}x compression", algorithm, stats.compression_ratio());
    }

    Ok(())
}

Output:

Algorithm: Lz4
------------------------------------------------------------
  Results:
    Compression ratio: 27.25x
    Space savings:     96.3%

Algorithm: Zstd { level: 3 }
------------------------------------------------------------
  Results:
    Compression ratio: 1.50x
    Space savings:     33.3%

Example: Entropy Routing

Shows how data is routed to different backends based on entropy.

use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder, PAGE_SIZE};

fn main() -> anyhow::Result<()> {
    let compressor = CompressorBuilder::new()
        .algorithm(Algorithm::Lz4)
        .build()?;

    // Set entropy threshold to 7.0 bits/byte
    let mut device = BlockDevice::with_entropy_threshold(
        64 * 1024 * 1024,
        compressor,
        7.0,
    );

    // Low entropy - routed to GPU
    let low_entropy = vec![0u8; PAGE_SIZE];
    device.write(0, &low_entropy)?;

    // Medium entropy - routed to SIMD
    let medium: Vec<u8> = (0..PAGE_SIZE).map(|i| (i % 16) as u8).collect();
    device.write(PAGE_SIZE as u64, &medium)?;

    // High entropy - routed to scalar
    let high: Vec<u8> = (0..PAGE_SIZE).map(|i| ((i * 17 + 31) % 256) as u8).collect();
    device.write(2 * PAGE_SIZE as u64, &high)?;

    let stats = device.stats();
    println!("GPU pages:    {}", stats.gpu_pages);
    println!("SIMD pages:   {}", stats.simd_pages);
    println!("Scalar pages: {}", stats.scalar_pages);
    println!("Zero pages:   {}", stats.zero_pages);

    Ok(())
}

Output:

trueno-ublk Entropy Routing Example
===================================

Routing Statistics:
  GPU pages:    1 (low entropy, highly compressible)
  SIMD pages:   1 (medium entropy, normal data)
  Scalar pages: 3 (high entropy, incompressible)
  Zero pages:   1 (all zeros, deduplicated)

Total compression ratio: 2.88x

CLI Examples

Create a Device

# Create 1TB device with LZ4 and GPU acceleration
trueno-ublk create -s 1T -a lz4 --gpu

# Create with memory limit
trueno-ublk create -s 512G -a zstd --mem-limit 8G

# Create with custom entropy threshold
trueno-ublk create -s 256G -a lz4 --entropy-threshold 6.5

Monitor Devices

# List all devices
trueno-ublk list

# Show detailed stats
trueno-ublk stat /dev/ublkb0

# Interactive dashboard
trueno-ublk top

Manage Devices

# Reset statistics
trueno-ublk reset /dev/ublkb0

# Compact memory
trueno-ublk compact /dev/ublkb0

# Set runtime parameters
trueno-ublk set /dev/ublkb0 --mem-limit 16G

Contributing

Development Setup

# Clone the repository
git clone https://github.com/paiml/trueno-zram
cd trueno-zram

# Build
cargo build --all-features

# Run tests
cargo test --workspace --all-features

# Run with CUDA
cargo test --workspace --features cuda

Code Style

Format with cargo fmt
Lint with cargo clippy --all-features -- -D warnings
No panics in library code
All public items must be documented

Testing

Unit Tests

cargo test --workspace --all-features

Coverage

cargo llvm-cov --workspace --all-features

Target: 95% line coverage.

Mutation Testing

cargo mutants --package trueno-zram-core

Target: 80% mutation score.

Quality Gates

Before submitting a PR:

Formatting: cargo fmt --check
Linting: cargo clippy --all-features -- -D warnings
Tests: cargo test --workspace --all-features
Documentation: cargo doc --no-deps
Coverage: >= 95%

Commit Messages

Follow conventional commits:

feat: Add new compression algorithm
fix: Handle edge case in decompression
perf: Optimize hash table lookup
docs: Update API documentation
test: Add property-based tests
refactor: Simplify SIMD dispatch

Always reference the work item:

feat: Add GPU batch compression (Refs ZRAM-001)

Pull Request Process

Fork the repository
Create a feature branch
Make changes with tests
Run quality gates
Submit PR with description
Address review feedback

Architecture

See Design Overview for architecture decisions.

Adding a New Algorithm

Create module in src/algorithms/
Implement PageCompressor trait
Add SIMD implementations
Add to Algorithm enum
Write tests and benchmarks
Update documentation

Adding SIMD Backend

Add detection in src/simd/detect.rs
Create implementation file (e.g., avx512.rs)
Add dispatch in src/simd/dispatch.rs
Write correctness tests
Run benchmarks

Reporting Issues

Use GitHub Issues with:

Clear description
Reproduction steps
Expected vs actual behavior
System info (CPU, OS, Rust version)

License

Contributions are licensed under MIT OR Apache-2.0.

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.1.0 - 2025-01-04

Added

Core compression library (trueno-zram-core)
- LZ4 compression with AVX2/AVX-512/NEON acceleration
- ZSTD compression with SIMD optimization
- Runtime CPU feature detection and dispatch
- Same-fill detection for 2048:1 zero page compression
GPU batch compression
- CUDA support via cudarc
- Pure Rust PTX generation via trueno-gpu
- Warp-cooperative LZ4 kernel (4 warps/block)
- PCIe 5x rule evaluation
- Async DMA ring buffer support
Kernel compatibility
- Sysfs interface compatible with kernel zram
- Algorithm compatibility layer
- Statistics matching kernel format
Benchmarking utilities
- Criterion benchmarks
- Example programs for testing
- Performance measurement infrastructure

Performance

LZ4 compression: 4.4 GB/s (AVX-512)
LZ4 decompression: 5.4 GB/s (AVX-512)
ZSTD compression: 11.2 GB/s (AVX-512)
ZSTD decompression: 46 GB/s (AVX-512)
Same-fill detection: 22 GB/s

Infrastructure

Published to crates.io as trueno-zram-core
461 tests passing
94% test coverage
Full documentation with mdBook

0.2.0 - 2026-01-06

Added

DT-007: Swap Deadlock Prevention
- mlock() integration via duende-mlock crate
- Daemon memory pinning prevents swap deadlock
- Works in both foreground and background modes
trueno-ublk daemon
- ublk-based block device for kernel bypass
- Hybrid CPU/GPU architecture
- 12.5 GB/s sequential read (fio verified)
- 228K IOPS random 4K read

Performance Improvements

Sequential I/O: 16.5 GB/s (1.8x vs kernel ZRAM)
Random 4K IOPS: 249K (4.5x vs kernel ZRAM)
Compression ratio: 3.87x (+55% vs kernel ZRAM)
CPU parallel decompression: 50+ GB/s

Fixed

Background mode mlock (DT-007e: mlock called after fork)
Clippy warnings in GPU batch compression module

[3.17.0] - 2026-01-07

Added

VIZ-001: TruenoCollector - Renacer visualization integration
- Implements renacer::Collector trait for metrics collection
- Feeds throughput, IOPS, tier distribution to visualization framework
VIZ-002: --visualize flag - Real-time TUI dashboard
- Tier heatmap, throughput gauge, entropy timeline
- Requires --foreground mode
VIZ-003: Benchmark reports - JSON/HTML export
- trueno-ublk benchmark --format json|html|text
- trueno-renacer-v1 schema for ML pipelines
- Self-contained HTML reports with tier distribution charts
VIZ-004: OTLP integration - Distributed tracing
- --otlp-endpoint and --otlp-service-name flags
- Export traces to Jaeger/Tempo
Release Verification Matrix (docs/release_qa_checklist.md)
- Falsification-first QA protocol
- Performance thresholds: >7.2 GB/s zero-page, >550K IOPS

Performance

ZSTD Recommendation: ZSTD-1 is 3x faster than LZ4 on AVX-512
- Compress: 15.4 GiB/s (vs 5.2 GiB/s LZ4)
- Decompress: ~10 GiB/s (vs ~1.5 GiB/s LZ4)
- Usage: --algorithm zstd

Documentation

New book chapter: Visualization & Observability
Updated kernel-zram-parity roadmap (all items COMPLETE)
New examples:
- visualization_demo - TruenoCollector metrics demo
- zstd_vs_lz4 - Algorithm performance comparison
- tiered_storage - Kernel-cooperative architecture demo

Unreleased

Planned

trueno-zram-adaptive: ML-driven algorithm selection
trueno-zram-generator: systemd integration
trueno-zram-cli: zramctl replacement

Keyboard shortcuts

trueno-zram