Introduction
trueno-zram is a SIMD and GPU-accelerated memory compression library for Linux systems. It provides userspace Rust implementations of LZ4 and ZSTD compression that leverage modern CPU vector instructions (AVX2, AVX-512, NEON) and optional CUDA GPU acceleration.
Production Status (2026-01-06)
MILESTONE DT-005 ACHIEVED: trueno-zram is running as system swap!
- 8GB device active as primary swap (priority 150)
- Validated under memory pressure with ~185MB swap utilization
- CPU SIMD compress (20-30 GB/s) + CPU parallel decompress (48 GB/s)
DT-007 COMPLETED: Swap deadlock issue FIXED via mlock() - 211 MB daemon memory pinned.
Known Limitations:
- GPU compression blocked by NVIDIA F082 bug (F081 was falsified, using CPU SIMD instead)
- I/O throughput lower than kernel ZRAM (userspace overhead)
Why trueno-zram?
trueno-zram provides better compression efficiency than kernel ZRAM:
| Advantage | trueno-zram | Kernel ZRAM |
|---|---|---|
| Compression Ratio | 3.87x | 2.5x |
| Space Efficiency | 55% better | baseline |
| P99 Latency | 16.5 µs | varies |
Trade-off: Kernel ZRAM has higher raw I/O throughput (operates entirely in kernel space). trueno-zram uses ublk which adds userspace overhead but provides:
- Runtime SIMD dispatch: Automatically selects AVX-512, AVX2, or NEON based on CPU
- Userspace flexibility: debugging, monitoring, custom algorithms
- Adaptive algorithm selection: ML-driven selection based on page entropy
- Better compression: 3.87x vs kernel’s 2.5x
Validated Performance (QA-FALSIFY-001)
| Claim | Result | Status |
|---|---|---|
| Compression ratio | 3.87x | PASS |
| SIMD compression | 20-30 GB/s | PASS |
| SIMD decompression | 48 GB/s | PASS |
| P99 latency | 16.5 µs | PASS |
| mlock (DT-007) | 211 MB | PASS |
| Actual | |
|---|---|
| Kernel 3-13x faster | |
| 123K IOPS |
All metrics independently verified via falsification testing (2026-01-06)
Part of the PAIML Ecosystem
trueno-zram is part of the “Batuta Stack”:
- trueno - High-performance SIMD compute library
- trueno-gpu - Pure Rust PTX generation for CUDA
- aprender - Machine learning in pure Rust
- certeza - Asymptotic test effectiveness framework
License
trueno-zram is dual-licensed under MIT and Apache-2.0.
Installation
From crates.io
Add trueno-zram-core to your Cargo.toml:
[dependencies]
trueno-zram-core = "0.1"
Or use cargo add:
cargo add trueno-zram-core
With CUDA Support
For GPU acceleration, enable the cuda feature:
[dependencies]
trueno-zram-core = { version = "0.1", features = ["cuda"] }
Or:
cargo add trueno-zram-core --features cuda
CUDA Requirements
- CUDA Toolkit 12.8 or later
- NVIDIA driver supporting CUDA 12.8
- GPU with compute capability >= 7.0 (Volta or newer)
Feature Flags
| Feature | Description | Default |
|---|---|---|
std | Standard library support | Yes |
nightly | Nightly-only SIMD features | No |
cuda | CUDA GPU acceleration | No |
System Requirements
- OS: Linux (kernel >= 5.10 LTS)
- CPU: x86_64 (AVX2/AVX-512) or AArch64 (NEON)
- Rust: 1.82.0 or later (MSRV)
Verifying Installation
use trueno_zram_core::{CompressorBuilder, Algorithm};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
println!("trueno-zram installed successfully!");
println!("SIMD backend: {:?}", compressor.backend());
Ok(())
}
Building from Source
git clone https://github.com/paiml/trueno-zram
cd trueno-zram
# Build all crates
cargo build --release --all-features
# Run tests
cargo test --workspace --all-features
# Build with CUDA
cargo build --release --features cuda
Quick Start
Basic Compression
use trueno_zram_core::{CompressorBuilder, Algorithm, PAGE_SIZE};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create a compressor with LZ4 algorithm
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
// Create a test page (4KB)
let mut page = [0u8; PAGE_SIZE];
page[0..100].copy_from_slice(&[0xAA; 100]);
// Compress
let compressed = compressor.compress(&page)?;
println!("Original size: {} bytes", PAGE_SIZE);
println!("Compressed size: {} bytes", compressed.data.len());
println!("Ratio: {:.2}x", compressed.ratio());
// Decompress
let decompressed = compressor.decompress(&compressed)?;
assert_eq!(page, decompressed);
println!("Decompression verified!");
Ok(())
}
Choosing an Algorithm
#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};
// LZ4: Fastest compression, good for most workloads
let lz4 = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
// ZSTD Level 1: Better ratio, still fast
let zstd_fast = CompressorBuilder::new()
.algorithm(Algorithm::Zstd { level: 1 })
.build()?;
// ZSTD Level 3: Best ratio for compressible data
let zstd_best = CompressorBuilder::new()
.algorithm(Algorithm::Zstd { level: 3 })
.build()?;
// Adaptive: Automatically selects based on entropy
let adaptive = CompressorBuilder::new()
.algorithm(Algorithm::Adaptive)
.build()?;
}
Compression Statistics
#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
// Compress some pages
for _ in 0..100 {
let page = [0u8; PAGE_SIZE];
let _ = compressor.compress(&page)?;
}
// Get statistics
let stats = compressor.stats();
println!("Pages compressed: {}", stats.pages);
println!("Total input: {} bytes", stats.bytes_in);
println!("Total output: {} bytes", stats.bytes_out);
println!("Overall ratio: {:.2}x", stats.ratio());
println!("Throughput: {:.2} GB/s", stats.throughput_gbps());
}
Error Handling
use trueno_zram_core::{CompressorBuilder, Algorithm, Error};
fn compress_page(data: &[u8; PAGE_SIZE]) -> Result<Vec<u8>, Error> {
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
let compressed = compressor.compress(data)?;
Ok(compressed.data)
}
fn main() {
let page = [0u8; PAGE_SIZE];
match compress_page(&page) {
Ok(data) => println!("Compressed to {} bytes", data.len()),
Err(Error::BufferTooSmall(msg)) => eprintln!("Buffer error: {msg}"),
Err(Error::CorruptedData(msg)) => eprintln!("Corrupt data: {msg}"),
Err(e) => eprintln!("Other error: {e}"),
}
}
Next Steps
- Learn about GPU Batch Compression
- Explore SIMD Acceleration
- See more Examples
Examples
Running Examples
trueno-zram includes several examples to demonstrate its features:
# GPU information and backend selection
cargo run -p trueno-zram-core --example gpu_info
cargo run -p trueno-zram-core --example gpu_info --features cuda
# Compression benchmarks
cargo run -p trueno-zram-core --example compress_benchmark --release
cargo run -p trueno-zram-core --example compress_benchmark --release --features cuda
GPU Info Example
Shows GPU detection, backend selection logic, and PCIe 5x rule evaluation:
trueno-zram GPU Information
============================
1. GPU Availability
----------------
GPU available: true
CUDA Device Information:
Device: NVIDIA GeForce RTX 4090
Compute Capability: SM 8.9
Memory: 24564 MB
L2 Cache: 73728 KB
Optimal batch: 14745 pages
Supported: true
2. Backend Selection Logic
-----------------------
GPU_MIN_BATCH_SIZE: 1000 pages
PAGE_SIZE: 4096 bytes
Batch No GPU With GPU
------------------------------------------
1 Scalar Scalar
10 Simd Simd
100 Simd Simd
500 Simd Simd
1000 Simd Gpu
5000 Simd Gpu
10000 Simd Gpu
3. PCIe 5x Rule Evaluation
-----------------------
GPU offload beneficial when: T_cpu > 5 * (T_transfer + T_gpu)
1K pages, PCIe 4.0, 100 GB/s GPU (4 MB): CPU preferred
10K pages, PCIe 4.0, 100 GB/s GPU (40 MB): GPU beneficial
100K pages, PCIe 5.0, 500 GB/s GPU (400 MB): GPU beneficial
Compression Benchmark
Measures throughput across different data patterns:
trueno-zram Compression Benchmark
=================================
Pattern: Zeros (compressible)
----------------------------------------------------------------------
Pages Algorithm Compress Decompress Ratio Backend
100 Lz4 22.01 GB/s 46.02 GB/s 2048.00x Avx512
1000 Lz4 21.87 GB/s 45.91 GB/s 2048.00x Avx512
Pattern: Text (compressible)
----------------------------------------------------------------------
Pages Algorithm Compress Decompress Ratio Backend
100 Lz4 4.57 GB/s 5.42 GB/s 3.21x Avx512
1000 Lz4 4.44 GB/s 5.37 GB/s 3.21x Avx512
Pattern: Random (incompressible)
----------------------------------------------------------------------
Pages Algorithm Compress Decompress Ratio Backend
100 Lz4 1.87 GB/s 43.78 GB/s 1.00x Avx512
1000 Lz4 1.61 GB/s 31.64 GB/s 1.00x Avx512
Custom Example: Batch Processing
use trueno_zram_core::{CompressorBuilder, Algorithm, PAGE_SIZE};
use std::time::Instant;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
// Generate test pages
let pages: Vec<[u8; PAGE_SIZE]> = (0..1000)
.map(|i| {
let mut page = [0u8; PAGE_SIZE];
// Create compressible pattern
for (j, byte) in page.iter_mut().enumerate() {
*byte = ((i + j) % 256) as u8;
}
page
})
.collect();
// Benchmark compression
let start = Instant::now();
let mut total_compressed = 0;
for page in &pages {
let compressed = compressor.compress(page)?;
total_compressed += compressed.data.len();
}
let elapsed = start.elapsed();
let input_bytes = pages.len() * PAGE_SIZE;
let throughput = input_bytes as f64 / elapsed.as_secs_f64() / 1e9;
let ratio = input_bytes as f64 / total_compressed as f64;
println!("Compressed {} pages in {:?}", pages.len(), elapsed);
println!("Throughput: {:.2} GB/s", throughput);
println!("Compression ratio: {:.2}x", ratio);
Ok(())
}
Custom Example: GPU Batch Compression
use trueno_zram_core::gpu::{GpuBatchCompressor, GpuBatchConfig, gpu_available};
use trueno_zram_core::{Algorithm, PAGE_SIZE};
fn main() -> Result<(), Box<dyn std::error::Error>> {
if !gpu_available() {
println!("No GPU available, skipping GPU example");
return Ok(());
}
let config = GpuBatchConfig {
device_index: 0,
algorithm: Algorithm::Lz4,
batch_size: 1000,
async_dma: true,
ring_buffer_slots: 4,
};
let mut compressor = GpuBatchCompressor::new(config)?;
// Create batch of pages
let pages: Vec<[u8; PAGE_SIZE]> = vec![[0u8; PAGE_SIZE]; 1000];
// Compress batch
let result = compressor.compress_batch(&pages)?;
println!("Batch Results:");
println!(" Pages: {}", result.pages.len());
println!(" H2D time: {} ns", result.h2d_time_ns);
println!(" Kernel time: {} ns", result.kernel_time_ns);
println!(" D2H time: {} ns", result.d2h_time_ns);
println!(" Total time: {} ns", result.total_time_ns);
println!(" Compression ratio: {:.2}x", result.compression_ratio());
println!(" PCIe 5x rule satisfied: {}", result.pcie_rule_satisfied());
// Get compressor stats
let stats = compressor.stats();
println!("\nCompressor Stats:");
println!(" Pages compressed: {}", stats.pages_compressed);
println!(" Throughput: {:.2} GB/s", stats.throughput_gbps());
Ok(())
}
Compression Algorithms
trueno-zram supports multiple compression algorithms optimized for memory page compression.
LZ4
LZ4 is a lossless compression algorithm focused on compression and decompression speed.
#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
}
Characteristics
| Metric | Value |
|---|---|
| Compression speed | 4.4 GB/s (AVX-512) |
| Decompression speed | 5.4 GB/s (AVX-512) |
| Typical ratio | 2-4x for compressible data |
| Best for | Speed-critical workloads |
When to Use LZ4
- Real-time compression requirements
- High-throughput memory compression
- When CPU overhead must be minimal
- Mixed workloads with varying compressibility
ZSTD (Zstandard)
ZSTD provides better compression ratios while maintaining good speed.
#![allow(unused)]
fn main() {
// Fast mode (level 1)
let fast = CompressorBuilder::new()
.algorithm(Algorithm::Zstd { level: 1 })
.build()?;
// Better compression (level 3)
let better = CompressorBuilder::new()
.algorithm(Algorithm::Zstd { level: 3 })
.build()?;
}
Compression Levels
| Level | Compression | Decompression | Ratio |
|---|---|---|---|
| 1 | 11.2 GB/s | 46 GB/s | Better than LZ4 |
| 3 | 8.5 GB/s | 45 GB/s | Best |
When to Use ZSTD
- Memory-constrained systems
- Highly compressible data (text, logs)
- When compression ratio matters more than speed
- Cold/archived memory pages
Adaptive Selection
The adaptive algorithm automatically selects based on page entropy:
#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Adaptive)
.build()?;
}
Selection Logic
- Same-fill detection: Pages with repeated values use 2048:1 encoding
- Entropy analysis: Shannon entropy determines compressibility
- Algorithm selection:
- Low entropy (< 4 bits): ZSTD for best ratio
- Medium entropy (4-7 bits): LZ4 for balance
- High entropy (> 7 bits): Pass-through (incompressible)
Same-Fill Optimization
Pages filled with the same byte value get special 2048:1 compression:
#![allow(unused)]
fn main() {
use trueno_zram_core::samefill::{detect_same_fill, CompactSameFill};
let zero_page = [0u8; 4096];
if let Some(fill) = detect_same_fill(&zero_page) {
let compact = CompactSameFill::new(fill);
// Only 2 bytes needed to represent 4096-byte page!
assert_eq!(compact.to_bytes().len(), 2);
}
}
Same-Fill Statistics
- Zero pages: ~30-40% of typical memory
- Same-fill pages: ~35-45% total
- Compression ratio: 2048:1
Algorithm Comparison
| Algorithm | Compress | Decompress | Ratio | Use Case |
|---|---|---|---|---|
| LZ4 | 4.4 GB/s | 5.4 GB/s | 2-4x | General |
| ZSTD-1 | 11.2 GB/s | 46 GB/s | 3-5x | Balanced |
| ZSTD-3 | 8.5 GB/s | 45 GB/s | 4-6x | Best ratio |
| Same-fill | N/A | N/A | 2048x | Zero/repeated |
| Adaptive | Varies | Varies | Optimal | Automatic |
SIMD Acceleration
trueno-zram uses runtime CPU feature detection to select the optimal SIMD implementation.
Supported Backends
| Backend | Instruction Set | Register Width | Platforms |
|---|---|---|---|
| AVX-512 | AVX-512F/BW/VL | 512-bit | Skylake-X, Ice Lake, Zen 4 |
| AVX2 | AVX2 + FMA | 256-bit | Haswell+, Zen 1+ |
| NEON | ARM NEON | 128-bit | ARMv8-A (AArch64) |
| Scalar | None | 64-bit | All platforms |
Runtime Detection
#![allow(unused)]
fn main() {
use trueno_zram_core::simd::{detect, SimdFeatures};
let features = detect();
println!("AVX-512: {}", features.has_avx512());
println!("AVX2: {}", features.has_avx2());
println!("SSE4.2: {}", features.has_sse42());
}
Automatic Dispatch
The compressor automatically uses the best available backend:
#![allow(unused)]
fn main() {
use trueno_zram_core::CompressorBuilder;
let compressor = CompressorBuilder::new().build()?;
// Check which backend was selected
println!("Backend: {:?}", compressor.backend());
}
Performance by Backend
LZ4 Compression
| Backend | Throughput | Relative |
|---|---|---|
| AVX-512 | 4.4 GB/s | 1.45x |
| AVX2 | 3.2 GB/s | 1.05x |
| Scalar | 3.0 GB/s | 1.0x |
ZSTD Compression
| Backend | Throughput | Relative |
|---|---|---|
| AVX-512 | 11.2 GB/s | 1.40x |
| AVX2 | 8.5 GB/s | 1.06x |
| Scalar | 8.0 GB/s | 1.0x |
SIMD Optimizations
Hash Table Lookups (LZ4)
AVX-512 enables parallel hash probing for match finding:
// Scalar: Sequential probe
for offset in 0..16 {
if hash_table[hash + offset] == pattern { ... }
}
// AVX-512: Parallel probe (16 comparisons at once)
let matches = _mm512_cmpeq_epi32(hash_values, pattern_broadcast);
Literal Copying
Wide vector moves for copying uncompressed literals:
// AVX-512: 64 bytes per iteration
_mm512_storeu_si512(dst, _mm512_loadu_si512(src));
// AVX2: 32 bytes per iteration
_mm256_storeu_si256(dst, _mm256_loadu_si256(src));
Match Extension
SIMD comparison for extending matches:
#![allow(unused)]
fn main() {
// Compare 64 bytes at once with AVX-512
let cmp = _mm512_cmpeq_epi8(src_chunk, dst_chunk);
let mask = _mm512_movepi8_mask(cmp);
let match_len = mask.trailing_ones();
}
Forcing a Backend
For testing or benchmarking, you can force a specific backend:
#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, SimdBackend};
// Force scalar (no SIMD)
let scalar = CompressorBuilder::new()
.prefer_backend(SimdBackend::Scalar)
.build()?;
// Force AVX2 (will fail if not available)
let avx2 = CompressorBuilder::new()
.prefer_backend(SimdBackend::Avx2)
.build()?;
}
GPU Batch Compression
trueno-zram supports CUDA GPU acceleration for batch compression of memory pages.
Current Status (2026-01-06)
Important: GPU compression is currently blocked by NVIDIA PTX bug F082 (Computed Address Bug). The production architecture uses:
- Compression: CPU SIMD (AVX-512) at 20-30 GB/s with 3.87x ratio
- Decompression: CPU parallel at 50+ GB/s (primary path)
GPU decompression is available but CPU parallel path is faster due to PCIe transfer overhead (~6 GB/s end-to-end vs 50+ GB/s CPU).
NVIDIA F082 Bug (Computed Address Bug)
F081 (Loaded Value Bug) was FALSIFIED on 2026-01-05 - the pattern works correctly.
The actual bug is F082: addresses computed from shared memory values cause crashes:
// F081 pattern - WORKS CORRECTLY (falsified):
ld.shared.u32 %r_val, [addr]; // Load from shared memory
st.global.u32 [dest], %r_val; // Actually works!
// F082 pattern - CRASHES:
ld.shared.u32 %r_offset, [shared_addr]; // Load offset from shared memory
add.u64 %r_dest, %r_base, %r_offset; // Compute destination address
st.global.u32 [%r_dest], %r_data; // CRASH - address derived from shared load
Status: True root cause identified via Popperian falsification. See KF-002 and ublk-batched-gpu-compression.md for details.
When to Use GPU
GPU decompression is beneficial when:
- Large batches: 2000+ pages to decompress
- PCIe 5x rule satisfied: Computation time > 5x transfer time
- GPU available: CUDA-capable GPU with SM 7.0+
Basic Usage
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{GpuBatchCompressor, GpuBatchConfig, gpu_available};
use trueno_zram_core::{Algorithm, PAGE_SIZE};
if gpu_available() {
let config = GpuBatchConfig {
device_index: 0,
algorithm: Algorithm::Lz4,
batch_size: 1000,
async_dma: true,
ring_buffer_slots: 4,
};
let mut compressor = GpuBatchCompressor::new(config)?;
let pages: Vec<[u8; PAGE_SIZE]> = vec![[0u8; PAGE_SIZE]; 1000];
let result = compressor.compress_batch(&pages)?;
println!("Compression ratio: {:.2}x", result.compression_ratio());
}
}
Configuration Options
#![allow(unused)]
fn main() {
pub struct GpuBatchConfig {
/// CUDA device index (0 = first GPU)
pub device_index: u32,
/// Compression algorithm
pub algorithm: Algorithm,
/// Number of pages per batch
pub batch_size: usize,
/// Enable async DMA transfers
pub async_dma: bool,
/// Ring buffer slots for pipelining
pub ring_buffer_slots: usize,
}
}
Batch Results
The BatchResult provides timing breakdown:
#![allow(unused)]
fn main() {
let result = compressor.compress_batch(&pages)?;
// Timing components
println!("H2D transfer: {} ns", result.h2d_time_ns);
println!("Kernel execution: {} ns", result.kernel_time_ns);
println!("D2H transfer: {} ns", result.d2h_time_ns);
println!("Total time: {} ns", result.total_time_ns);
// Metrics
let throughput = result.throughput_bytes_per_sec(pages.len() * PAGE_SIZE);
println!("Throughput: {:.2} GB/s", throughput / 1e9);
println!("Compression ratio: {:.2}x", result.compression_ratio());
println!("PCIe 5x rule satisfied: {}", result.pcie_rule_satisfied());
}
Backend Selection
Use select_backend to determine optimal backend:
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{select_backend, gpu_available};
let batch_size = 5000;
let has_gpu = gpu_available();
let backend = select_backend(batch_size, has_gpu);
println!("Selected backend: {:?}", backend);
// Output: Gpu (for large batches with GPU available)
}
Supported GPUs
| GPU | Architecture | SM | Optimal Batch |
|---|---|---|---|
| H100 | Hopper | 9.0 | 10,240 pages |
| A100 | Ampere | 8.0 | 8,192 pages |
| RTX 4090 | Ada | 8.9 | 14,745 pages |
| RTX 3090 | Ampere | 8.6 | 6,144 pages |
Pure Rust PTX Generation
trueno-zram uses trueno-gpu for pure Rust PTX generation:
- No LLVM dependency
- No nvcc required
- Kernel code in Rust, compiled to PTX at runtime
- Warp-cooperative LZ4 compression (4 warps/block)
#![allow(unused)]
fn main() {
// The LZ4 kernel processes 4 pages per block (1 page per warp)
// Uses shared memory for hash tables and match finding
// cvta.shared.u64 for generic addressing
}
Same-Fill Detection
Same-fill optimization provides 2048:1 compression for pages containing a single repeated byte value.
Why Same-Fill Matters
Memory pages often contain:
- Zero pages: ~30-40% of typical memory (uninitialized, freed)
- Same-fill pages: ~5-10% additional (memset patterns)
Detecting and encoding these specially provides massive compression wins.
Detection
#![allow(unused)]
fn main() {
use trueno_zram_core::samefill::detect_same_fill;
use trueno_zram_core::PAGE_SIZE;
let zero_page = [0u8; PAGE_SIZE];
let pattern_page = [0xAA; PAGE_SIZE];
let mixed_page = [0u8; PAGE_SIZE];
// Zero page detected
assert!(detect_same_fill(&zero_page).is_some());
// Pattern page detected
assert!(detect_same_fill(&pattern_page).is_some());
// Mixed content not detected
let mut mixed = [0u8; PAGE_SIZE];
mixed[100] = 0xFF;
assert!(detect_same_fill(&mixed).is_none());
}
Compact Encoding
Same-fill pages compress to just 2 bytes:
#![allow(unused)]
fn main() {
use trueno_zram_core::samefill::CompactSameFill;
let fill_value = 0u8;
let compact = CompactSameFill::new(fill_value);
// Serialize (2 bytes)
let bytes = compact.to_bytes();
assert_eq!(bytes.len(), 2);
// Deserialize
let restored = CompactSameFill::from_bytes(&bytes)?;
assert_eq!(restored.fill_value(), 0);
// Expand back to full page
let page = restored.expand();
assert_eq!(page.len(), PAGE_SIZE);
assert!(page.iter().all(|&b| b == 0));
}
Compression Ratio
| Page Type | Original | Compressed | Ratio |
|---|---|---|---|
| Zero-fill | 4096 B | 2 B | 2048:1 |
| 0xFF-fill | 4096 B | 2 B | 2048:1 |
| Any same-fill | 4096 B | 2 B | 2048:1 |
Integration with Compressor
The compressor automatically detects same-fill pages:
#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
let zero_page = [0u8; PAGE_SIZE];
let compressed = compressor.compress(&zero_page)?;
// Same-fill pages get special encoding
println!("Compressed size: {} bytes", compressed.data.len());
// Output: ~20 bytes (LZ4 minimal encoding for same-fill)
}
Performance
Same-fill detection is extremely fast:
#![allow(unused)]
fn main() {
// SIMD-accelerated comparison
// AVX-512: Check 64 bytes per iteration
// AVX2: Check 32 bytes per iteration
// Typical: <100ns for 4KB page
}
Memory Statistics
On typical systems:
| Memory Type | Same-Fill % |
|---|---|
| Idle desktop | 60-70% |
| Active workload | 35-45% |
| Database server | 25-35% |
| Compilation | 40-50% |
CompressorBuilder API
The CompressorBuilder provides a fluent API for configuring compression.
Basic Usage
#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
}
Builder Methods
algorithm(algo: Algorithm)
Sets the compression algorithm:
#![allow(unused)]
fn main() {
// LZ4 (fastest)
.algorithm(Algorithm::Lz4)
// ZSTD with compression level
.algorithm(Algorithm::Zstd { level: 1 })
.algorithm(Algorithm::Zstd { level: 3 })
// Adaptive (auto-select based on entropy)
.algorithm(Algorithm::Adaptive)
}
prefer_backend(backend: SimdBackend)
Forces a specific SIMD backend:
#![allow(unused)]
fn main() {
use trueno_zram_core::SimdBackend;
// Force scalar (no SIMD)
.prefer_backend(SimdBackend::Scalar)
// Force AVX2
.prefer_backend(SimdBackend::Avx2)
// Force AVX-512
.prefer_backend(SimdBackend::Avx512)
}
build()
Creates the compressor:
#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?; // Returns Result<Compressor, Error>
}
Compressor Methods
compress(&self, page: &[u8; PAGE_SIZE]) -> Result<CompressedPage>
Compresses a single page:
#![allow(unused)]
fn main() {
let page = [0u8; PAGE_SIZE];
let compressed = compressor.compress(&page)?;
println!("Size: {} bytes", compressed.data.len());
println!("Ratio: {:.2}x", compressed.ratio());
}
decompress(&self, page: &CompressedPage) -> Result<[u8; PAGE_SIZE]>
Decompresses a page:
#![allow(unused)]
fn main() {
let decompressed = compressor.decompress(&compressed)?;
assert_eq!(page, decompressed);
}
stats(&self) -> CompressionStats
Returns compression statistics:
#![allow(unused)]
fn main() {
let stats = compressor.stats();
println!("Pages: {}", stats.pages);
println!("Bytes in: {}", stats.bytes_in);
println!("Bytes out: {}", stats.bytes_out);
println!("Ratio: {:.2}x", stats.ratio());
println!("Compress time: {} ns", stats.compress_time_ns);
println!("Decompress time: {} ns", stats.decompress_time_ns);
println!("Throughput: {:.2} GB/s", stats.throughput_gbps());
}
reset_stats(&mut self)
Resets statistics counters:
#![allow(unused)]
fn main() {
compressor.reset_stats();
}
backend(&self) -> SimdBackend
Returns the active SIMD backend:
#![allow(unused)]
fn main() {
println!("Backend: {:?}", compressor.backend());
// Output: Avx512, Avx2, Neon, or Scalar
}
CompressedPage
The compressed page structure:
#![allow(unused)]
fn main() {
pub struct CompressedPage {
/// Compressed data
pub data: Vec<u8>,
/// Original size (always PAGE_SIZE)
pub original_size: usize,
/// Algorithm used
pub algorithm: Algorithm,
}
impl CompressedPage {
/// Compression ratio
pub fn ratio(&self) -> f64;
/// Bytes saved
pub fn bytes_saved(&self) -> usize;
/// Check if actually compressed
pub fn is_compressed(&self) -> bool;
}
}
Error Handling
#![allow(unused)]
fn main() {
use trueno_zram_core::Error;
match compressor.compress(&page) {
Ok(compressed) => { /* success */ }
Err(Error::BufferTooSmall(msg)) => { /* buffer issue */ }
Err(Error::CorruptedData(msg)) => { /* corrupt input */ }
Err(Error::InvalidInput(msg)) => { /* invalid params */ }
Err(e) => { /* other error */ }
}
}
GPU Batch API
The GPU batch API provides high-throughput compression for large page batches.
GpuBatchConfig
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::GpuBatchConfig;
use trueno_zram_core::Algorithm;
let config = GpuBatchConfig {
device_index: 0, // CUDA device (0 = first GPU)
algorithm: Algorithm::Lz4,
batch_size: 1000, // Pages per batch
async_dma: true, // Enable async transfers
ring_buffer_slots: 4, // Pipeline depth
};
}
GpuBatchCompressor
Creation
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{GpuBatchCompressor, GpuBatchConfig};
let config = GpuBatchConfig::default();
let mut compressor = GpuBatchCompressor::new(config)?;
}
Batch Compression
#![allow(unused)]
fn main() {
let pages: Vec<[u8; PAGE_SIZE]> = vec![[0u8; PAGE_SIZE]; 1000];
let result = compressor.compress_batch(&pages)?;
}
Statistics
#![allow(unused)]
fn main() {
let stats = compressor.stats();
println!("Pages compressed: {}", stats.pages_compressed);
println!("Input bytes: {}", stats.total_bytes_in);
println!("Output bytes: {}", stats.total_bytes_out);
println!("Time: {} ns", stats.total_time_ns);
println!("Ratio: {:.2}x", stats.compression_ratio());
println!("Throughput: {:.2} GB/s", stats.throughput_gbps());
}
Configuration Access
#![allow(unused)]
fn main() {
let config = compressor.config();
println!("Batch size: {}", config.batch_size);
println!("Async DMA: {}", config.async_dma);
}
BatchResult
#![allow(unused)]
fn main() {
pub struct BatchResult {
/// Compressed pages
pub pages: Vec<CompressedPage>,
/// Host-to-device transfer time (ns)
pub h2d_time_ns: u64,
/// Kernel execution time (ns)
pub kernel_time_ns: u64,
/// Device-to-host transfer time (ns)
pub d2h_time_ns: u64,
/// Total wall clock time (ns)
pub total_time_ns: u64,
}
}
Methods
#![allow(unused)]
fn main() {
// Throughput in bytes/second
let throughput = result.throughput_bytes_per_sec(input_bytes);
// Compression ratio
let ratio = result.compression_ratio();
// Check PCIe 5x rule
let beneficial = result.pcie_rule_satisfied();
}
Helper Functions
gpu_available()
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::gpu_available;
if gpu_available() {
println!("CUDA GPU detected");
}
}
select_backend()
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{select_backend, BackendSelection};
let backend = select_backend(batch_size, gpu_available());
match backend {
BackendSelection::Gpu => { /* use GPU */ }
BackendSelection::Simd => { /* use CPU SIMD */ }
BackendSelection::Scalar => { /* use scalar */ }
}
}
meets_pcie_rule()
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::meets_pcie_rule;
let pages = 10000;
let pcie_bandwidth = 64.0; // GB/s (PCIe 5.0)
let gpu_throughput = 500.0; // GB/s
if meets_pcie_rule(pages, pcie_bandwidth, gpu_throughput) {
println!("GPU offload beneficial");
}
}
GpuDeviceInfo
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::GpuDeviceInfo;
let info = GpuDeviceInfo {
index: 0,
name: "RTX 4090".to_string(),
total_memory: 24 * 1024 * 1024 * 1024,
l2_cache_size: 72 * 1024 * 1024,
compute_capability: (8, 9),
backend: GpuBackend::Cuda,
};
println!("Optimal batch: {} pages", info.optimal_batch_size());
println!("Supported: {}", info.is_supported());
}
Kernel Compatibility API
The compat module provides sysfs interface compatibility with kernel zram.
SysfsInterface
Emulates the kernel zram sysfs interface:
#![allow(unused)]
fn main() {
use trueno_zram_core::compat::SysfsInterface;
let mut interface = SysfsInterface::new();
// Configure like kernel zram
interface.write_attr("disksize", "4G")?;
interface.write_attr("comp_algorithm", "lz4")?;
interface.write_attr("mem_limit", "2G")?;
// Read attributes
let disksize = interface.read_attr("disksize")?;
let algorithm = interface.read_attr("comp_algorithm")?;
}
Supported Attributes
| Attribute | Read | Write | Description |
|---|---|---|---|
disksize | Yes | Yes | Disk size in bytes |
comp_algorithm | Yes | Yes | Compression algorithm |
mem_limit | Yes | Yes | Memory limit |
mem_used_max | Yes | No | Peak memory usage |
mem_used_total | Yes | No | Current memory usage |
orig_data_size | Yes | No | Original data size |
compr_data_size | Yes | No | Compressed data size |
num_reads | Yes | No | Read count |
num_writes | Yes | No | Write count |
invalid_io | Yes | No | Invalid I/O count |
notify_free | Yes | No | Free notifications |
reset | No | Yes | Reset device |
Algorithm Support
#![allow(unused)]
fn main() {
use trueno_zram_core::compat::{ZramAlgorithm, is_algorithm_supported};
// Check algorithm support
assert!(is_algorithm_supported("lz4"));
assert!(is_algorithm_supported("zstd"));
assert!(is_algorithm_supported("lzo")); // Compatibility alias
// Parse algorithm
let algo = "lz4".parse::<ZramAlgorithm>()?;
}
Supported Algorithms
| Name | Alias | Description |
|---|---|---|
lz4 | - | LZ4 fast compression |
lz4hc | - | LZ4 high compression |
zstd | - | Zstandard |
lzo | lzo-rle | LZO (mapped to LZ4) |
842 | deflate | 842/deflate (mapped to ZSTD) |
Statistics
MmStat
Memory statistics (like /sys/block/zram0/mm_stat):
#![allow(unused)]
fn main() {
use trueno_zram_core::compat::MmStat;
let stat = interface.mm_stat();
println!("Original size: {} bytes", stat.orig_data_size);
println!("Compressed size: {} bytes", stat.compr_data_size);
println!("Memory used: {} bytes", stat.mem_used_total);
println!("Memory limit: {} bytes", stat.mem_limit);
println!("Memory max: {} bytes", stat.mem_used_max);
println!("Same pages: {}", stat.same_pages);
println!("Pages stored: {}", stat.pages_compacted);
println!("Huge pages: {}", stat.huge_pages);
}
IoStat
I/O statistics (like /sys/block/zram0/io_stat):
#![allow(unused)]
fn main() {
use trueno_zram_core::compat::IoStat;
let stat = interface.io_stat();
println!("Reads: {}", stat.num_reads);
println!("Writes: {}", stat.num_writes);
println!("Invalid I/O: {}", stat.invalid_io);
println!("Notify free: {}", stat.notify_free);
}
Device Reset
#![allow(unused)]
fn main() {
// Reset all statistics and data
interface.write_attr("reset", "1")?;
}
Integration Example
#![allow(unused)]
fn main() {
use trueno_zram_core::compat::SysfsInterface;
use trueno_zram_core::PAGE_SIZE;
let mut interface = SysfsInterface::new();
// Configure
interface.write_attr("disksize", "1G")?;
interface.write_attr("comp_algorithm", "lz4")?;
// Simulate writes
let page = [0xAA; PAGE_SIZE];
interface.write_page(0, &page)?;
interface.write_page(1, &page)?;
// Check statistics
let mm = interface.mm_stat();
println!("Compression ratio: {:.2}x",
mm.orig_data_size as f64 / mm.compr_data_size as f64);
}
Benchmarks
Running Benchmarks
# Criterion benchmarks
cargo bench --all-features
# With baseline comparison
cargo bench --all-features -- --save-baseline main
# Example benchmarks
cargo run -p trueno-zram-core --example compress_benchmark --release
Results Summary
LZ4 Performance
| Backend | Compress | Decompress | Ratio |
|---|---|---|---|
| AVX-512 | 4.4 GB/s | 5.4 GB/s | 2-4x |
| AVX2 | 3.2 GB/s | 4.1 GB/s | 2-4x |
| Scalar | 3.0 GB/s | 3.8 GB/s | 2-4x |
ZSTD Performance
| Backend | Level | Compress | Decompress | Ratio |
|---|---|---|---|---|
| AVX-512 | 1 | 11.2 GB/s | 46 GB/s | 3-5x |
| AVX-512 | 3 | 8.5 GB/s | 45 GB/s | 4-6x |
| AVX2 | 1 | 8.5 GB/s | 35 GB/s | 3-5x |
Same-Fill Performance
| Backend | Detection | Ratio |
|---|---|---|
| AVX-512 | 22 GB/s | 2048:1 |
| AVX2 | 18 GB/s | 2048:1 |
| Scalar | 12 GB/s | 2048:1 |
Data Patterns
Zeros (Best Case)
Pattern: Zeros (100% same-fill)
Pages: 1000
Compression: 22 GB/s
Decompression: 46 GB/s
Ratio: 2048:1
Text (Compressible)
Pattern: Text/Code
Pages: 1000
LZ4 Compression: 4.4 GB/s
LZ4 Decompression: 5.4 GB/s
Ratio: 3.2:1
Random (Incompressible)
Pattern: Random bytes
Pages: 1000
LZ4 Compression: 1.6 GB/s
LZ4 Decompression: 32 GB/s
Ratio: 1.0:1 (pass-through)
GPU Benchmarks
Note (2026-01-06): GPU decompression is limited by PCIe transfer overhead. CPU parallel path (50+ GB/s) is faster for most workloads.
RTX 4090 (Validated)
| Path | Throughput | Notes |
|---|---|---|
| CPU Parallel | 50+ GB/s | Primary recommended path |
| GPU End-to-End | ~6 GB/s | PCIe 4.0 transfer bottleneck |
| GPU Kernel-only | ~9 GB/s | Without H2D/D2H transfers |
Recommendation
Use CPU parallel decompression for best performance. GPU useful for:
- Future PCIe 5.0+ systems with higher bandwidth
- Workloads where CPU is saturated
Latency
| Operation | P50 | P99 | P99.9 |
|---|---|---|---|
| LZ4 compress (4KB) | 45us | 85us | 120us |
| LZ4 decompress (4KB) | 38us | 72us | 95us |
| Same-fill detect | 8us | 15us | 25us |
Memory Usage
| Component | Memory |
|---|---|
| Hash table (LZ4) | 64 KB |
| Working buffer | 16 KB |
| ZSTD context | 256 KB |
Comparison with Linux Kernel zram
Validated 2026-01-06 on RTX 4090 + AMD Threadripper 7960X (AVX-512)
Block Device I/O (fio, Direct I/O)
| Metric | Kernel ZRAM | trueno-ublk | Speedup |
|---|---|---|---|
| Sequential Read | 9.2 GB/s | 16.5 GB/s | 1.8x |
| Random 4K IOPS | 55K | 249K | 4.5x |
| Compression Ratio | 2.5x | 3.87x | +55% |
Compression Engine (cargo examples)
| Metric | Linux Kernel zram | trueno-zram | Speedup |
|---|---|---|---|
| Compress (parallel) | 3-5 GB/s | 20-30 GB/s | 5-6x |
| Decompress (CPU) | ~10 GB/s | 50+ GB/s | 5x |
| Same-fill detection | ~8 GB/s | 22 GB/s | 2.75x |
| Compression ratio | 2.5x | 3.87x | +55% |
Architecture Difference
| Aspect | Linux Kernel | trueno-zram |
|---|---|---|
| Threading | Single-threaded per page | Parallel (rayon) |
| SIMD | Limited | AVX-512/AVX2/NEON |
| Batch processing | No | Yes (5000+ pages) |
| GPU offload | No | Optional CUDA |
Falsification Testing (2026-01-06)
Test Configuration:
├── Hardware: AMD Threadripper 7960X, 125GB RAM, RTX 4090
├── Device: trueno-ublk 8GB device
├── Tool: fio, cargo examples
Results:
├── Sequential I/O: 16.5 GB/s ✓ (claim: 12.5 GB/s)
├── Random IOPS: 249K ✓ (claim: 228K)
├── Compression: 30.66 GB/s ✓ (claim: 20-24 GB/s)
├── Ratio: 3.87x ✓ (claim: 3.7x)
├── mlock: 272 MB ✓ (claim: >100 MB)
├── Stress test: No deadlock ✓
└── Status: 6/8 PASS (GPU claims deprecated)
Why trueno-zram is Faster
- SIMD vectorization: AVX-512 processes 64 bytes per instruction vs byte-by-byte
- Parallel compression: All CPU cores compress simultaneously via rayon
- Batch amortization: Setup costs spread across thousands of pages
- Cache efficiency: Sequential memory access patterns
- Zero-copy paths: Same-fill pages detected without compression attempt
Tuning Guide
Algorithm Selection
LZ4 vs ZSTD
| Workload | Recommended | Reason |
|---|---|---|
| Real-time | LZ4 | Lowest latency |
| Memory-constrained | ZSTD | Better ratio |
| Mixed content | Adaptive | Auto-selects |
| Highly compressible | ZSTD-3 | Best ratio |
Code Example
#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};
// For latency-sensitive workloads
let fast = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
// For memory-constrained systems
let compact = CompressorBuilder::new()
.algorithm(Algorithm::Zstd { level: 3 })
.build()?;
// For unknown workloads
let adaptive = CompressorBuilder::new()
.algorithm(Algorithm::Adaptive)
.build()?;
}
SIMD Backend Selection
The library auto-detects the best backend, but you can force one:
#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, SimdBackend};
// Force AVX2 (e.g., for testing)
let compressor = CompressorBuilder::new()
.prefer_backend(SimdBackend::Avx2)
.build()?;
}
GPU Batch Sizing
Optimal Batch Size
| GPU | L2 Cache | Optimal Batch |
|---|---|---|
| H100 | 50 MB | 10,240 pages |
| A100 | 40 MB | 8,192 pages |
| RTX 4090 | 72 MB | 14,745 pages |
| RTX 3090 | 6 MB | 1,200 pages |
Calculation
#![allow(unused)]
fn main() {
fn optimal_batch_size(l2_cache_bytes: usize) -> usize {
// Each page needs ~3KB working memory
// Target 80% L2 cache utilization
(l2_cache_bytes * 80 / 100) / (3 * 1024)
}
}
Async DMA
Enable async DMA for overlapping transfers:
#![allow(unused)]
fn main() {
let config = GpuBatchConfig {
async_dma: true,
ring_buffer_slots: 4, // Pipeline depth
..Default::default()
};
}
Benefits:
- Overlaps H2D, compute, D2H
- 20-30% throughput improvement
- Higher GPU utilization
Memory Configuration
Working Memory
#![allow(unused)]
fn main() {
// Per-thread memory usage
const LZ4_HASH_TABLE: usize = 64 * 1024; // 64 KB
const ZSTD_CONTEXT: usize = 256 * 1024; // 256 KB
const WORKING_BUFFER: usize = 16 * 1024; // 16 KB
}
Reducing Memory
For memory-constrained systems:
#![allow(unused)]
fn main() {
// Use LZ4 (smaller context)
.algorithm(Algorithm::Lz4)
// Smaller batch sizes
let config = GpuBatchConfig {
batch_size: 100,
..Default::default()
};
}
Monitoring Performance
#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
// Process pages...
let stats = compressor.stats();
// Check throughput
if stats.throughput_gbps() < 3.0 {
println!("Warning: Low throughput, check CPU affinity");
}
// Check ratio
if stats.ratio() < 1.5 {
println!("Warning: Low compression, data may be incompressible");
}
}
CPU Affinity
For best performance, pin compression threads to physical cores:
# Linux: Pin to cores 0-3
taskset -c 0-3 ./my_app
# Check NUMA topology
numactl --hardware
Kernel Parameters
For zram integration:
# Increase zram size
echo $((8 * 1024 * 1024 * 1024)) > /sys/block/zram0/disksize
# Set compression streams
echo 4 > /sys/block/zram0/max_comp_streams
# Enable writeback
echo 1 > /sys/block/zram0/writeback
PCIe 5x Rule
The PCIe 5x rule determines when GPU offload is beneficial for compression.
The Rule
GPU beneficial when: T_compute > 5 × T_transfer
Where:
T_compute= CPU computation timeT_transfer= PCIe transfer time (H2D + D2H)
Why 5x?
GPU offload has overhead:
- H2D transfer: Copy data to GPU
- Kernel launch: ~5-10us overhead
- D2H transfer: Copy results back
- Synchronization: Wait for completion
The 5x factor accounts for these overheads and ensures GPU provides net benefit.
Calculation
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::meets_pcie_rule;
use trueno_zram_core::PAGE_SIZE;
fn check_gpu_benefit(
pages: usize,
pcie_bandwidth_gbps: f64,
gpu_throughput_gbps: f64,
) -> bool {
let data_bytes = pages * PAGE_SIZE;
// Transfer time (round trip)
let transfer_time = 2.0 * data_bytes as f64 / (pcie_bandwidth_gbps * 1e9);
// GPU compute time
let gpu_time = data_bytes as f64 / (gpu_throughput_gbps * 1e9);
// CPU compute time (assume 4 GB/s baseline)
let cpu_time = data_bytes as f64 / (4e9);
// GPU beneficial if saves time
cpu_time > (transfer_time + gpu_time) * 1.2 // 20% margin
}
}
Examples
PCIe 4.0 x16 (25 GB/s)
| Batch | Data Size | Transfer | GPU Time | Beneficial? |
|---|---|---|---|---|
| 100 | 400 KB | 32 us | 4 us | No |
| 1,000 | 4 MB | 320 us | 40 us | Marginal |
| 10,000 | 40 MB | 3.2 ms | 400 us | Yes |
| 100,000 | 400 MB | 32 ms | 4 ms | Yes |
PCIe 5.0 x16 (64 GB/s)
| Batch | Data Size | Transfer | GPU Time | Beneficial? |
|---|---|---|---|---|
| 1,000 | 4 MB | 125 us | 40 us | No |
| 10,000 | 40 MB | 1.25 ms | 400 us | Yes |
| 100,000 | 400 MB | 12.5 ms | 4 ms | Yes |
Checking in Code
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{GpuBatchCompressor, GpuBatchConfig};
let mut compressor = GpuBatchCompressor::new(config)?;
let result = compressor.compress_batch(&pages)?;
if result.pcie_rule_satisfied() {
println!("GPU offload was beneficial");
println!("Kernel time: {} ns", result.kernel_time_ns);
println!("Transfer time: {} ns",
result.h2d_time_ns + result.d2h_time_ns);
} else {
println!("Consider using CPU for this batch size");
}
}
Backend Selection
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{select_backend, BackendSelection, gpu_available};
let batch_size = 5000;
let has_gpu = gpu_available();
match select_backend(batch_size, has_gpu) {
BackendSelection::Gpu => {
// Use GPU batch compression
}
BackendSelection::Simd => {
// Use CPU SIMD compression
}
BackendSelection::Scalar => {
// Use scalar compression
}
}
}
Optimization Tips
- Batch larger: Combine small batches into larger ones
- Use async DMA: Overlap transfers with computation
- Profile first: Measure actual transfer times
- Consider hybrid: Use CPU for small batches, GPU for large
Hardware-Specific Thresholds
| GPU | PCIe | Min Beneficial Batch |
|---|---|---|
| H100 | 5.0 x16 | 5,000 pages |
| A100 | 4.0 x16 | 8,000 pages |
| RTX 4090 | 4.0 x16 | 10,000 pages |
| RTX 3090 | 4.0 x16 | 12,000 pages |
Design Overview
Architecture
┌─────────────────────────────────────────────────────────┐
│ Public API │
│ CompressorBuilder, Algorithm, CompressedPage │
├─────────────────────────────────────────────────────────┤
│ Algorithm Selection │
│ ┌─────────┐ ┌─────────┐ ┌──────────┐ ┌─────────┐ │
│ │ LZ4 │ │ ZSTD │ │ Adaptive │ │Samefill │ │
│ └────┬────┘ └────┬────┘ └────┬─────┘ └────┬────┘ │
├───────┼────────────┼────────────┼─────────────┼────────┤
│ │ SIMD Dispatch │ │ │
│ ┌────▼────┐ ┌────▼────┐ ┌───▼───┐ ┌─────▼─────┐ │
│ │ AVX-512 │ │ AVX2 │ │ NEON │ │ Scalar │ │
│ └─────────┘ └─────────┘ └───────┘ └───────────┘ │
├─────────────────────────────────────────────────────────┤
│ GPU Backend (Optional) │
│ ┌─────────────────────────────────────────────────┐ │
│ │ CUDA Batch Compressor (trueno-gpu PTX) │ │
│ │ ├── H2D Transfer │ │
│ │ ├── Warp-Cooperative LZ4 Kernel │ │
│ │ └── D2H Transfer │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Crate Structure
trueno-zram/
├── crates/
│ ├── trueno-zram-core/ # Core compression library
│ │ ├── src/
│ │ │ ├── lib.rs # Public API
│ │ │ ├── error.rs # Error types
│ │ │ ├── page.rs # CompressedPage
│ │ │ ├── lz4/ # LZ4 implementation
│ │ │ ├── zstd/ # ZSTD implementation
│ │ │ ├── gpu/ # GPU batch compression
│ │ │ ├── simd/ # SIMD detection/dispatch
│ │ │ ├── samefill.rs # Same-fill detection
│ │ │ ├── compat.rs # Kernel compatibility
│ │ │ └── benchmark.rs # Benchmarking utilities
│ │ └── examples/
│ ├── trueno-zram-adaptive/ # ML-driven selection
│ ├── trueno-zram-generator/# systemd integration
│ └── trueno-zram-cli/ # CLI tool
└── bins/
└── trueno-ublk/ # ublk daemon
Key Design Decisions
1. Runtime SIMD Dispatch
CPU features are detected at runtime, not compile time:
#![allow(unused)]
fn main() {
// Detection happens once at startup
let features = simd::detect();
// Dispatch based on available features
if features.has_avx512() {
lz4::avx512::compress(input, output)
} else if features.has_avx2() {
lz4::avx2::compress(input, output)
} else {
lz4::scalar::compress(input, output)
}
}
2. Page-Based Compression
All compression operates on fixed 4KB pages:
#![allow(unused)]
fn main() {
pub const PAGE_SIZE: usize = 4096;
// This is enforced at the type level
pub fn compress(page: &[u8; PAGE_SIZE]) -> Result<CompressedPage>;
}
3. Builder Pattern
Configuration via builder pattern:
#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.prefer_backend(SimdBackend::Avx512)
.build()?;
}
4. Trait-Based Abstraction
The PageCompressor trait enables polymorphism:
#![allow(unused)]
fn main() {
pub trait PageCompressor {
fn compress(&self, page: &[u8; PAGE_SIZE]) -> Result<CompressedPage>;
fn decompress(&self, page: &CompressedPage) -> Result<[u8; PAGE_SIZE]>;
}
}
5. Zero-Copy Where Possible
Minimize allocations in hot paths:
#![allow(unused)]
fn main() {
// Output buffer passed in, not allocated
fn compress_into(input: &[u8], output: &mut [u8]) -> Result<usize>;
}
6. No Panics in Library Code
All errors are returned as Result:
#![allow(unused)]
#![deny(clippy::panic)]
#![deny(clippy::unwrap_used)]
fn main() {
}
Dependencies
| Crate | Purpose |
|---|---|
thiserror | Error derive macros |
cudarc | CUDA driver bindings |
rayon | Parallel iteration |
trueno-gpu | Pure Rust PTX generation |
Feature Flags
| Flag | Default | Description |
|---|---|---|
std | Yes | Standard library |
nightly | No | Nightly SIMD features |
cuda | No | CUDA GPU support |
SIMD Dispatch
Overview
trueno-zram uses runtime CPU feature detection to select the optimal SIMD implementation.
Detection
#![allow(unused)]
fn main() {
// src/simd/detect.rs
pub fn detect() -> SimdFeatures {
#[cfg(target_arch = "x86_64")]
{
SimdFeatures {
avx512: is_x86_feature_detected!("avx512f")
&& is_x86_feature_detected!("avx512bw")
&& is_x86_feature_detected!("avx512vl"),
avx2: is_x86_feature_detected!("avx2"),
sse42: is_x86_feature_detected!("sse4.2"),
}
}
#[cfg(target_arch = "aarch64")]
{
SimdFeatures {
neon: true, // Always available on AArch64
}
}
}
}
Dispatch Pattern
#![allow(unused)]
fn main() {
// src/simd/dispatch.rs
pub fn compress(input: &[u8], output: &mut [u8]) -> Result<usize> {
let features = detect();
#[cfg(target_arch = "x86_64")]
{
if features.has_avx512() {
return unsafe { avx512::compress(input, output) };
}
if features.has_avx2() {
return unsafe { avx2::compress(input, output) };
}
}
#[cfg(target_arch = "aarch64")]
{
if features.has_neon() {
return unsafe { neon::compress(input, output) };
}
}
scalar::compress(input, output)
}
}
Implementation Structure
Each algorithm has separate implementations:
lz4/
├── mod.rs # Public API, dispatch
├── compress.rs # Core algorithm logic
├── decompress.rs # Decompression
├── avx512.rs # AVX-512 specialization
├── avx2.rs # AVX2 specialization
├── neon.rs # ARM NEON specialization
└── scalar.rs # Fallback implementation
AVX-512 Implementation
#![allow(unused)]
fn main() {
// lz4/avx512.rs
#[target_feature(enable = "avx512f,avx512bw,avx512vl")]
pub unsafe fn compress(input: &[u8], output: &mut [u8]) -> Result<usize> {
// 64-byte hash table lookups
let hash_chunk = _mm512_loadu_si512(input.as_ptr());
// Parallel match finding
let matches = _mm512_cmpeq_epi32(hash_chunk, pattern);
// Wide literal copies
_mm512_storeu_si512(output.as_mut_ptr(), data);
// ...
}
}
AVX2 Implementation
#![allow(unused)]
fn main() {
// lz4/avx2.rs
#[target_feature(enable = "avx2")]
pub unsafe fn compress(input: &[u8], output: &mut [u8]) -> Result<usize> {
// 32-byte operations
let chunk = _mm256_loadu_si256(input.as_ptr());
// Match extension
let cmp = _mm256_cmpeq_epi8(src, dst);
let mask = _mm256_movemask_epi8(cmp);
// ...
}
}
NEON Implementation
#![allow(unused)]
fn main() {
// lz4/neon.rs
#[cfg(target_arch = "aarch64")]
pub unsafe fn compress(input: &[u8], output: &mut [u8]) -> Result<usize> {
use std::arch::aarch64::*;
// 16-byte operations
let chunk = vld1q_u8(input.as_ptr());
// Parallel comparison
let cmp = vceqq_u8(src, dst);
// ...
}
}
Benchmarking Dispatch
#![allow(unused)]
fn main() {
use trueno_zram_core::simd::{detect, SimdBackend};
fn benchmark_all_backends() {
let features = detect();
let input = [0xAA; PAGE_SIZE];
let mut output = [0u8; PAGE_SIZE * 2];
// Benchmark available backends
if features.has_avx512() {
bench("AVX-512", || avx512::compress(&input, &mut output));
}
if features.has_avx2() {
bench("AVX2", || avx2::compress(&input, &mut output));
}
bench("Scalar", || scalar::compress(&input, &mut output));
}
}
Compile-Time Optimization
For maximum performance, enable target features:
# .cargo/config.toml
[build]
rustflags = ["-C", "target-cpu=native"]
Or for specific features:
rustflags = ["-C", "target-feature=+avx2,+avx512f,+avx512bw"]
GPU Pipeline
Overview
The GPU pipeline uses CUDA for batch compression with async DMA.
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ Host │───▶│ H2D │───▶│ Kernel │───▶│ D2H │
│ Memory │ │ Transfer │ │ Execution │ │ Transfer │
└────────────┘ └────────────┘ └────────────┘ └────────────┘
▲ │
└──────────────────────────────────────────────────────┘
Pipeline Stages
1. Host-to-Device Transfer
#![allow(unused)]
fn main() {
fn transfer_to_device(&self, pages: &[[u8; PAGE_SIZE]]) -> Result<CudaSlice<u8>> {
// Flatten pages into contiguous buffer
let total_bytes = pages.len() * PAGE_SIZE;
let mut flat_data = Vec::with_capacity(total_bytes);
for page in pages {
flat_data.extend_from_slice(page);
}
// Async copy to device
let device_buffer = self.stream.clone_htod(&flat_data)?;
Ok(device_buffer)
}
}
2. Kernel Execution
#![allow(unused)]
fn main() {
fn execute_kernel(&self, input: &CudaSlice<u8>, batch_size: u32) -> Result<CudaSlice<u8>> {
// Allocate output buffer
let mut output = self.stream.alloc_zeros::<u8>(batch_size as usize * PAGE_SIZE)?;
let mut sizes = self.stream.alloc_zeros::<u32>(batch_size as usize)?;
// Launch kernel
// Grid: ceil(batch_size / 4) blocks
// Block: 128 threads (4 warps)
unsafe {
self.stream
.launch_builder(&self.kernel_fn)
.arg(&input)
.arg(&mut output)
.arg(&mut sizes)
.arg(&batch_size)
.launch(cfg)?;
}
self.stream.synchronize()?;
Ok(output)
}
}
3. Device-to-Host Transfer
#![allow(unused)]
fn main() {
fn transfer_from_device(&self, data: CudaSlice<u8>) -> Result<Vec<CompressedPage>> {
let output = self.stream.clone_dtoh(&data)?;
// Convert to CompressedPage structures
// ...
}
}
Warp-Cooperative Kernel
The LZ4 kernel uses warp-cooperative compression:
Block (128 threads)
├── Warp 0 (32 threads) → Page 0
├── Warp 1 (32 threads) → Page 1
├── Warp 2 (32 threads) → Page 2
└── Warp 3 (32 threads) → Page 3
Shared Memory Layout
Shared Memory (48 KB per block)
├── Warp 0: 12 KB (hash table + working)
├── Warp 1: 12 KB
├── Warp 2: 12 KB
└── Warp 3: 12 KB
PTX Generation
Using trueno-gpu for pure Rust PTX:
#![allow(unused)]
fn main() {
use trueno_gpu::kernels::lz4::Lz4WarpCompressKernel;
use trueno_gpu::kernels::Kernel;
let kernel = Lz4WarpCompressKernel::new(65536);
let ptx_string = kernel.emit_ptx();
// Load PTX into CUDA module
let ptx = Ptx::from(ptx_string);
let module = context.load_module(ptx)?;
}
Async DMA Ring Buffer
#![allow(unused)]
fn main() {
struct AsyncPipeline {
slots: Vec<PipelineSlot>,
head: usize,
tail: usize,
}
struct PipelineSlot {
input_buffer: CudaSlice<u8>,
output_buffer: CudaSlice<u8>,
event: CudaEvent,
state: SlotState,
}
enum SlotState {
Free,
H2DInProgress,
KernelInProgress,
D2HInProgress,
Complete,
}
}
Pipelining
Time ──────────────────────────────────────────────▶
Slot 0: [H2D][Kernel][D2H]
Slot 1: [H2D][Kernel][D2H]
Slot 2: [H2D][Kernel][D2H]
Slot 3: [H2D][Kernel][D2H]
Performance Optimization
1. Pinned Memory
#![allow(unused)]
fn main() {
// Use pinned (page-locked) memory for faster transfers
let pinned_buffer = cuda_malloc_host(size)?;
}
2. Stream Overlap
#![allow(unused)]
fn main() {
// Use separate streams for H2D, compute, D2H
let h2d_stream = context.create_stream()?;
let compute_stream = context.create_stream()?;
let d2h_stream = context.create_stream()?;
}
3. Kernel Occupancy
#![allow(unused)]
fn main() {
// Optimal: 4 warps per block = 128 threads
// Shared memory: 48 KB (12 KB per warp)
// Registers: ~32 per thread
}
Error Handling
#![allow(unused)]
fn main() {
match kernel_result {
Ok(output) => process_output(output),
Err(CudaError::IllegalAddress) => {
// Memory access violation
fallback_to_cpu()?
}
Err(CudaError::LaunchFailed) => {
// Kernel launch failed
fallback_to_cpu()?
}
Err(e) => Err(Error::GpuNotAvailable(e.to_string())),
}
}
trueno-ublk Overview
trueno-ublk is a GPU-accelerated ZRAM replacement that uses the Linux ublk interface to provide a high-performance compressed block device in userspace.
Production Status (2026-01-06)
MILESTONE DT-005 ACHIEVED: trueno-ublk is running as system swap!
- 8GB device active as primary swap (priority 150)
- CPU SIMD compression at 20-30 GB/s with 3.87x ratio
- CPU parallel decompression at 50+ GB/s
DT-007 COMPLETED: Swap deadlock issue FIXED via mlock() - daemon memory pinned.
Known Limitations:
- Docker cannot isolate ublk devices (host kernel resources)
What is ublk?
ublk (userspace block device) is a Linux kernel interface that allows implementing block devices entirely in userspace. It uses io_uring for efficient I/O handling, avoiding the overhead of kernel context switches.
Architecture
┌─────────────────────────────────────────────────────────┐
│ Applications │
├─────────────────────────────────────────────────────────┤
│ Filesystem │
├─────────────────────────────────────────────────────────┤
│ /dev/ublkbN block device │
├─────────────────────────────────────────────────────────┤
│ Linux Kernel │
│ (ublk driver) │
├─────────────────────────────────────────────────────────┤
│ io_uring │
├─────────────────────────────────────────────────────────┤
│ trueno-ublk │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Entropy Router │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │ │
│ │ │ GPU │ │ SIMD │ │ Scalar │ │ │
│ │ │ (batch) │ │ (AVX2) │ │ (incompressible)│ │ │
│ │ └─────────┘ └─────────┘ └─────────────────┘ │ │
│ └───────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ trueno-zram-core (LZ4/ZSTD) │ │
│ └───────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Features
SIMD-Accelerated Compression
Uses trueno-zram-core for vectorized LZ4/ZSTD compression:
- AVX2 (256-bit) on modern x86_64
- AVX-512 (512-bit) on supported CPUs
- NEON (128-bit) on ARM64
GPU Batch Compression
Offloads compression to CUDA GPUs when beneficial:
- Warp-cooperative LZ4 kernel
- PCIe 5x rule evaluation
- Async DMA for overlap
Entropy-Based Routing
Analyzes data entropy to choose the optimal compression path:
- Low entropy (< 4.0 bits): GPU batch compression
- Medium entropy (4-7 bits): SIMD compression
- High entropy (> 7.0 bits): Scalar or skip compression
Zero-Page Deduplication
Automatically detects and deduplicates all-zero pages, achieving 2048:1 compression for sparse data.
zram-Compatible Statistics
Exports statistics in the same format as kernel zram, enabling drop-in monitoring compatibility.
CLI Usage
# Create a 1TB compressed RAM disk
trueno-ublk create -s 1T -a lz4 --gpu
# List devices
trueno-ublk list
# Show statistics
trueno-ublk stat /dev/ublkb0
# Interactive dashboard
trueno-ublk top
# Analyze data entropy
trueno-ublk entropy /dev/ublkb0
Requirements
- Linux kernel 6.0+ with ublk support
- CAP_SYS_ADMIN capability (for device creation)
- Optional: NVIDIA GPU with CUDA support
Block Device API
trueno-ublk provides a BlockDevice type for creating compressed block devices in pure Rust, without requiring kernel privileges.
Basic Usage
#![allow(unused)]
fn main() {
use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder, PAGE_SIZE};
// Create a compressor
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
// Create a 1GB block device
let mut device = BlockDevice::new(1 << 30, compressor);
// Write data (must be page-aligned)
let data = vec![0xAB; PAGE_SIZE];
device.write(0, &data)?;
// Read back
let mut buf = vec![0u8; PAGE_SIZE];
device.read(0, &mut buf)?;
assert_eq!(data, buf);
}
Page Size
All I/O operations must be aligned to PAGE_SIZE (4096 bytes):
#![allow(unused)]
fn main() {
use trueno_zram_core::PAGE_SIZE;
// Write at page boundaries
device.write(0, &data)?; // Page 0
device.write(PAGE_SIZE as u64, &data)?; // Page 1
device.write(2 * PAGE_SIZE as u64, &data)?; // Page 2
}
Statistics
Track compression performance with BlockDeviceStats:
#![allow(unused)]
fn main() {
let stats = device.stats();
println!("Pages stored: {}", stats.pages_stored);
println!("Bytes written: {}", stats.bytes_written);
println!("Bytes compressed: {}", stats.bytes_compressed);
println!("Compression ratio: {:.2}x", stats.compression_ratio());
println!("Zero pages: {}", stats.zero_pages);
println!("GPU pages: {}", stats.gpu_pages);
println!("SIMD pages: {}", stats.simd_pages);
println!("Scalar pages: {}", stats.scalar_pages);
}
Entropy Threshold
Configure entropy-based routing with with_entropy_threshold:
#![allow(unused)]
fn main() {
// Use threshold of 7.0 bits/byte
let device = BlockDevice::with_entropy_threshold(
1 << 30, // 1GB
compressor,
7.0, // Entropy threshold
);
}
Data with entropy above this threshold is routed to the scalar path (assumed incompressible).
Discard Operation
Free pages that are no longer needed:
#![allow(unused)]
fn main() {
// Discard page 0
device.discard(0, PAGE_SIZE as u64)?;
// Reading discarded pages returns zeros
let mut buf = vec![0xFFu8; PAGE_SIZE];
device.read(0, &mut buf)?;
assert!(buf.iter().all(|&b| b == 0));
}
UblkDevice (Kernel Interface)
For creating actual block devices visible to the system, use UblkDevice:
#![allow(unused)]
fn main() {
use trueno_ublk::{UblkDevice, DeviceConfig};
use trueno_zram_core::Algorithm;
let config = DeviceConfig {
size: 1 << 40, // 1TB
algorithm: Algorithm::Lz4,
streams: 4,
gpu_enabled: true,
mem_limit: Some(8 << 30), // 8GB RAM limit
backing_dev: None,
writeback_limit: None,
entropy_skip_threshold: 7.5,
gpu_batch_size: 1024,
};
// Requires CAP_SYS_ADMIN
let device = UblkDevice::create(config)?;
println!("Created device: /dev/ublkb{}", device.id());
}
DeviceStats
The DeviceStats struct provides zram-compatible statistics:
#![allow(unused)]
fn main() {
let stats = device.stats();
// mm_stat fields (zram compatible)
println!("Original data: {} bytes", stats.orig_data_size);
println!("Compressed data: {} bytes", stats.compr_data_size);
println!("Memory used: {} bytes", stats.mem_used_total);
println!("Same pages: {}", stats.same_pages);
println!("Huge pages: {}", stats.huge_pages);
// io_stat fields
println!("Failed reads: {}", stats.failed_reads);
println!("Failed writes: {}", stats.failed_writes);
// trueno-ublk extensions
println!("GPU pages: {}", stats.gpu_pages);
println!("SIMD pages: {}", stats.simd_pages);
println!("Throughput: {:.2} GB/s", stats.throughput_gbps);
println!("Avg entropy: {:.2} bits", stats.avg_entropy);
println!("SIMD backend: {}", stats.simd_backend);
}
Entropy Routing
trueno-ublk uses Shannon entropy analysis to route data to the optimal compression backend.
Shannon Entropy
Shannon entropy measures the information density of data in bits per byte. The formula is:
H(X) = -Σ p(x) * log2(p(x))
Where p(x) is the probability of each byte value occurring.
| Entropy (bits/byte) | Data Type | Compressibility |
|---|---|---|
| 0.0 | All same value | Extremely high |
| 1.0-3.0 | Simple patterns | Very high |
| 4.0-6.0 | Text, code | Good |
| 7.0-7.5 | Compressed data | Poor |
| 8.0 | Random/encrypted | None |
Routing Strategy
trueno-ublk routes pages based on their entropy:
┌──────────────────────────────────────────────────────────┐
│ Incoming Page │
│ │ │
│ Calculate Entropy │
│ │ │
│ ┌─────────────┼─────────────┐ │
│ ▼ ▼ ▼ │
│ entropy < 4.0 4.0 ≤ e ≤ 7.0 entropy > 7.0 │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ GPU Batch SIMD Path Scalar/Skip │
│ (highly comp.) (normal data) (incompress.) │
└──────────────────────────────────────────────────────────┘
GPU Batch Path (entropy < 4.0)
- Used for highly compressible data
- Benefits from GPU parallelism
- Batches multiple pages for efficiency
- Best for: zeros, repeating patterns, sparse data
SIMD Path (4.0 ≤ entropy ≤ 7.0)
- Used for typical data
- AVX2/AVX-512/NEON acceleration
- Single-page processing
- Best for: text, code, structured data
Scalar/Skip Path (entropy > 7.0)
- Used for incompressible data
- Avoids wasting CPU cycles
- May store uncompressed
- Best for: encrypted data, already-compressed media
Configuration
Set the entropy threshold when creating a device:
#![allow(unused)]
fn main() {
use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder};
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
// Lower threshold = more aggressive SIMD usage
let device = BlockDevice::with_entropy_threshold(
1 << 30, // Size
compressor,
6.5, // Custom threshold (default is 7.0)
);
}
Monitoring Entropy
Use the CLI to analyze device entropy:
# Show entropy distribution
trueno-ublk entropy /dev/ublkb0
# Output:
# Entropy Distribution:
# 0.0-2.0 bits: ████████████████████ 45% (GPU)
# 2.0-4.0 bits: ████████ 18% (GPU)
# 4.0-6.0 bits: ██████████ 22% (SIMD)
# 6.0-7.0 bits: ████ 10% (SIMD)
# 7.0-8.0 bits: ██ 5% (Scalar)
#
# Average entropy: 3.2 bits/byte
# Recommended threshold: 7.0
Statistics
Track routing decisions via stats:
#![allow(unused)]
fn main() {
let stats = device.stats();
println!("Routing Statistics:");
println!(" GPU pages: {} ({:.1}%)",
stats.gpu_pages,
100.0 * stats.gpu_pages as f64 / stats.pages_stored as f64);
println!(" SIMD pages: {} ({:.1}%)",
stats.simd_pages,
100.0 * stats.simd_pages as f64 / stats.pages_stored as f64);
println!(" Scalar pages: {} ({:.1}%)",
stats.scalar_pages,
100.0 * stats.scalar_pages as f64 / stats.pages_stored as f64);
println!(" Zero pages: {} ({:.1}%)",
stats.zero_pages,
100.0 * stats.zero_pages as f64 / stats.pages_stored as f64);
}
Zero-Page Optimization
All-zero pages receive special handling regardless of entropy:
#![allow(unused)]
fn main() {
// Zero pages are detected and deduplicated
let zeros = vec![0u8; PAGE_SIZE];
device.write(0, &zeros)?;
device.write(PAGE_SIZE as u64, &zeros)?;
device.write(2 * PAGE_SIZE as u64, &zeros)?;
let stats = device.stats();
// All three pages share the same zero-page representation
assert_eq!(stats.zero_pages, 3);
}
This achieves effective 2048:1 compression for sparse data like freshly-allocated memory.
Visualization & Observability
trueno-ublk v3.17.0 integrates with the renacer visualization framework for real-time monitoring, benchmarking, and distributed tracing.
Overview
The visualization system (VIZ-001/002/003/004) provides:
- Real-time TUI dashboard - Monitor throughput, IOPS, and tier distribution
- JSON/HTML reports - Export benchmark results for analysis
- OTLP integration - Distributed tracing to Jaeger/Tempo
CLI Flags
Real-time Visualization (VIZ-002)
# Launch TUI dashboard (requires foreground mode)
sudo trueno-ublk create --size 8G --backend tiered \
--visualize \
--foreground
Benchmark Reports (VIZ-003)
# Text output (default)
trueno-ublk benchmark --size 4G --format text
# JSON output (machine-readable)
trueno-ublk benchmark --size 4G --format json > results.json
# HTML report with charts
trueno-ublk benchmark --size 4G --format html -o report.html
# Include ML anomaly detection
trueno-ublk benchmark --size 4G --format json --ml-anomaly
OTLP Tracing (VIZ-004)
# Export traces to Jaeger
sudo trueno-ublk create --size 8G \
--otlp-endpoint http://localhost:4317 \
--otlp-service-name trueno-ublk
JSON Schema
The benchmark JSON output follows the trueno-renacer-v1 schema:
{
"version": "3.17.0",
"format": "trueno-renacer-v1",
"benchmark": {
"workload": "sequential",
"duration_sec": 60,
"size_bytes": 4294967296,
"backend": "tiered",
"iterations": 3
},
"metrics": {
"throughput_gbps": 7.9,
"iops": 666000,
"compression_ratio": 2.8,
"same_fill_pages": 1048576,
"tier_distribution": {
"kernel_zram": 0.65,
"simd_zstd": 0.25,
"nvme_direct": 0.05,
"same_fill": 0.05
},
"entropy_histogram": {
"p50": 4.2,
"p75": 5.8,
"p90": 6.9,
"p95": 7.3,
"p99": 7.8
}
},
"ml_analysis": {
"anomalies": [],
"clusters": 3,
"silhouette_score": 0.82
}
}
TruenoCollector API
For programmatic access, use TruenoCollector which implements renacer’s Collector trait:
#![allow(unused)]
fn main() {
use trueno_ublk::visualize::TruenoCollector;
use renacer::visualize::collectors::{Collector, MetricValue};
use std::sync::Arc;
// Create collector from TieredPageStore
let collector = TruenoCollector::new(Arc::clone(&store));
// Collect metrics
let metrics = collector.collect()?;
// Access individual metrics
if let Some(MetricValue::Gauge(throughput)) = metrics.get("throughput_gbps") {
println!("Throughput: {:.1} GB/s", throughput);
}
}
Metrics Reference
| Metric | Type | Description |
|---|---|---|
throughput_gbps | Gauge | Current I/O throughput in GB/s |
iops | Rate | Operations per second |
compression_ratio | Gauge | Overall compression ratio |
pages_total | Counter | Total pages stored |
same_fill_pages | Counter | Pages detected as same-fill |
tier_kernel_zram_pct | Gauge | % of pages in kernel ZRAM tier |
tier_simd_pct | Gauge | % of pages in SIMD ZSTD tier |
tier_skip_pct | Gauge | % of pages skipping compression |
tier_samefill_pct | Gauge | % of pages detected as same-fill |
Dashboard Panels
When using --visualize, the TUI displays:
| Panel | Description |
|---|---|
| Tier Heatmap | Real-time routing decisions |
| Throughput Gauge | Current GB/s with history |
| IOPS Counter | Operations per second |
| Entropy Timeline | Data compressibility over time |
| Ratio Trend | Compression ratio history |
Example: Benchmark Workflow
# 1. Run benchmark with JSON output
trueno-ublk benchmark --size 4G --format json \
--workload mixed --iterations 5 > bench.json
# 2. Generate HTML report
trueno-ublk benchmark --size 4G --format html \
--workload mixed -o benchmark-report.html
# 3. Analyze with jq
cat bench.json | jq '.metrics.throughput_gbps'
# 4. Compare algorithms
trueno-ublk benchmark --size 4G --backend memory --format json > lz4.json
trueno-ublk benchmark --size 4G --backend tiered --format json > tiered.json
Integration with Jaeger
For distributed tracing:
# Start Jaeger (all-in-one)
docker run -d --name jaeger \
-p 4317:4317 \
-p 16686:16686 \
jaegertracing/all-in-one:latest
# Create device with OTLP tracing
sudo trueno-ublk create --size 8G \
--backend tiered \
--otlp-endpoint http://localhost:4317 \
--otlp-service-name trueno-swap
# View traces at http://localhost:16686
trueno-ublk Examples
trueno-ublk includes several examples demonstrating its features.
Running Examples
# Basic block device usage
cargo run --example block_device -p trueno-ublk
# Compression statistics comparison
cargo run --example compression_stats -p trueno-ublk
# Entropy-based routing demonstration
cargo run --example entropy_routing -p trueno-ublk
# v3.17.0 Examples
# ----------------
# Visualization demo (VIZ-001/002/003/004)
cargo run --example visualization_demo -p trueno-ublk
# ZSTD vs LZ4 performance comparison (use --release for accurate benchmarks)
cargo run --example zstd_vs_lz4 -p trueno-ublk --release
# Tiered storage architecture demo (KERN-001/002/003)
cargo run --example tiered_storage -p trueno-ublk
# Batched compression benchmark
cargo run --example batched_benchmark -p trueno-ublk --release
Example: Basic Block Device
Demonstrates creating a compressed block device, writing data, and reading it back.
use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder, PAGE_SIZE};
fn main() -> anyhow::Result<()> {
// Create an LZ4 compressor
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
// Create a 64MB block device
let mut device = BlockDevice::new(64 * 1024 * 1024, compressor);
// Write different patterns
let compressible = vec![0xAA; PAGE_SIZE];
device.write(0, &compressible)?;
let zeros = vec![0u8; PAGE_SIZE];
device.write(PAGE_SIZE as u64, &zeros)?;
// Read back and verify
let mut buf = vec![0u8; PAGE_SIZE];
device.read(0, &mut buf)?;
assert_eq!(buf, compressible);
// Check statistics
let stats = device.stats();
println!("Compression ratio: {:.2}x", stats.compression_ratio());
Ok(())
}
Output:
trueno-ublk Block Device Example
=================================
Created LZ4 compressor with SIMD backend
Created block device: 64 MB
Writing test patterns...
Page 0: Highly compressible (all 0xAA)
Page 1: Zero page
Page 2: Sequential data
Page 3: Pseudo-random data
Reading back and verifying...
Page 0: OK
Page 1: OK
Page 2: OK
Page 3: OK
Device Statistics:
Pages stored: 4
Bytes written: 16 KB
Bytes compressed: 0 KB
Compression ratio: 27.86x
Zero pages: 1
All tests passed!
Example: Compression Statistics
Compares LZ4 and ZSTD compression on various data patterns.
use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder, PAGE_SIZE};
fn main() -> anyhow::Result<()> {
let test_data = vec![
("All zeros", vec![0u8; PAGE_SIZE]),
("Repeating pattern", (0..PAGE_SIZE).map(|i| (i % 4) as u8).collect()),
("High entropy", (0..PAGE_SIZE).map(|i| ((i * 17 + 31) % 256) as u8).collect()),
];
for algorithm in [Algorithm::Lz4, Algorithm::Zstd { level: 3 }] {
let compressor = CompressorBuilder::new()
.algorithm(algorithm)
.build()?;
let mut device = BlockDevice::new(64 * 1024 * 1024, compressor);
for (i, (_, data)) in test_data.iter().enumerate() {
device.write((i * PAGE_SIZE) as u64, data)?;
}
let stats = device.stats();
println!("{:?}: {:.2}x compression", algorithm, stats.compression_ratio());
}
Ok(())
}
Output:
Algorithm: Lz4
------------------------------------------------------------
Results:
Compression ratio: 27.25x
Space savings: 96.3%
Algorithm: Zstd { level: 3 }
------------------------------------------------------------
Results:
Compression ratio: 1.50x
Space savings: 33.3%
Example: Entropy Routing
Shows how data is routed to different backends based on entropy.
use trueno_ublk::BlockDevice;
use trueno_zram_core::{Algorithm, CompressorBuilder, PAGE_SIZE};
fn main() -> anyhow::Result<()> {
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
// Set entropy threshold to 7.0 bits/byte
let mut device = BlockDevice::with_entropy_threshold(
64 * 1024 * 1024,
compressor,
7.0,
);
// Low entropy - routed to GPU
let low_entropy = vec![0u8; PAGE_SIZE];
device.write(0, &low_entropy)?;
// Medium entropy - routed to SIMD
let medium: Vec<u8> = (0..PAGE_SIZE).map(|i| (i % 16) as u8).collect();
device.write(PAGE_SIZE as u64, &medium)?;
// High entropy - routed to scalar
let high: Vec<u8> = (0..PAGE_SIZE).map(|i| ((i * 17 + 31) % 256) as u8).collect();
device.write(2 * PAGE_SIZE as u64, &high)?;
let stats = device.stats();
println!("GPU pages: {}", stats.gpu_pages);
println!("SIMD pages: {}", stats.simd_pages);
println!("Scalar pages: {}", stats.scalar_pages);
println!("Zero pages: {}", stats.zero_pages);
Ok(())
}
Output:
trueno-ublk Entropy Routing Example
===================================
Routing Statistics:
GPU pages: 1 (low entropy, highly compressible)
SIMD pages: 1 (medium entropy, normal data)
Scalar pages: 3 (high entropy, incompressible)
Zero pages: 1 (all zeros, deduplicated)
Total compression ratio: 2.88x
CLI Examples
Create a Device
# Create 1TB device with LZ4 and GPU acceleration
trueno-ublk create -s 1T -a lz4 --gpu
# Create with memory limit
trueno-ublk create -s 512G -a zstd --mem-limit 8G
# Create with custom entropy threshold
trueno-ublk create -s 256G -a lz4 --entropy-threshold 6.5
Monitor Devices
# List all devices
trueno-ublk list
# Show detailed stats
trueno-ublk stat /dev/ublkb0
# Interactive dashboard
trueno-ublk top
Manage Devices
# Reset statistics
trueno-ublk reset /dev/ublkb0
# Compact memory
trueno-ublk compact /dev/ublkb0
# Set runtime parameters
trueno-ublk set /dev/ublkb0 --mem-limit 16G
Contributing
Development Setup
# Clone the repository
git clone https://github.com/paiml/trueno-zram
cd trueno-zram
# Build
cargo build --all-features
# Run tests
cargo test --workspace --all-features
# Run with CUDA
cargo test --workspace --features cuda
Code Style
- Format with
cargo fmt - Lint with
cargo clippy --all-features -- -D warnings - No panics in library code
- All public items must be documented
Testing
Unit Tests
cargo test --workspace --all-features
Coverage
cargo llvm-cov --workspace --all-features
Target: 95% line coverage.
Mutation Testing
cargo mutants --package trueno-zram-core
Target: 80% mutation score.
Quality Gates
Before submitting a PR:
- Formatting:
cargo fmt --check - Linting:
cargo clippy --all-features -- -D warnings - Tests:
cargo test --workspace --all-features - Documentation:
cargo doc --no-deps - Coverage: >= 95%
Commit Messages
Follow conventional commits:
feat: Add new compression algorithm
fix: Handle edge case in decompression
perf: Optimize hash table lookup
docs: Update API documentation
test: Add property-based tests
refactor: Simplify SIMD dispatch
Always reference the work item:
feat: Add GPU batch compression (Refs ZRAM-001)
Pull Request Process
- Fork the repository
- Create a feature branch
- Make changes with tests
- Run quality gates
- Submit PR with description
- Address review feedback
Architecture
See Design Overview for architecture decisions.
Adding a New Algorithm
- Create module in
src/algorithms/ - Implement
PageCompressortrait - Add SIMD implementations
- Add to
Algorithmenum - Write tests and benchmarks
- Update documentation
Adding SIMD Backend
- Add detection in
src/simd/detect.rs - Create implementation file (e.g.,
avx512.rs) - Add dispatch in
src/simd/dispatch.rs - Write correctness tests
- Run benchmarks
Reporting Issues
Use GitHub Issues with:
- Clear description
- Reproduction steps
- Expected vs actual behavior
- System info (CPU, OS, Rust version)
License
Contributions are licensed under MIT OR Apache-2.0.
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
0.1.0 - 2025-01-04
Added
-
Core compression library (
trueno-zram-core)- LZ4 compression with AVX2/AVX-512/NEON acceleration
- ZSTD compression with SIMD optimization
- Runtime CPU feature detection and dispatch
- Same-fill detection for 2048:1 zero page compression
-
GPU batch compression
- CUDA support via cudarc
- Pure Rust PTX generation via trueno-gpu
- Warp-cooperative LZ4 kernel (4 warps/block)
- PCIe 5x rule evaluation
- Async DMA ring buffer support
-
Kernel compatibility
- Sysfs interface compatible with kernel zram
- Algorithm compatibility layer
- Statistics matching kernel format
-
Benchmarking utilities
- Criterion benchmarks
- Example programs for testing
- Performance measurement infrastructure
Performance
- LZ4 compression: 4.4 GB/s (AVX-512)
- LZ4 decompression: 5.4 GB/s (AVX-512)
- ZSTD compression: 11.2 GB/s (AVX-512)
- ZSTD decompression: 46 GB/s (AVX-512)
- Same-fill detection: 22 GB/s
Infrastructure
- Published to crates.io as
trueno-zram-core - 461 tests passing
- 94% test coverage
- Full documentation with mdBook
0.2.0 - 2026-01-06
Added
-
DT-007: Swap Deadlock Prevention
- mlock() integration via duende-mlock crate
- Daemon memory pinning prevents swap deadlock
- Works in both foreground and background modes
-
trueno-ublk daemon
- ublk-based block device for kernel bypass
- Hybrid CPU/GPU architecture
- 12.5 GB/s sequential read (fio verified)
- 228K IOPS random 4K read
Performance Improvements
- Sequential I/O: 16.5 GB/s (1.8x vs kernel ZRAM)
- Random 4K IOPS: 249K (4.5x vs kernel ZRAM)
- Compression ratio: 3.87x (+55% vs kernel ZRAM)
- CPU parallel decompression: 50+ GB/s
Fixed
- Background mode mlock (DT-007e: mlock called after fork)
- Clippy warnings in GPU batch compression module
[3.17.0] - 2026-01-07
Added
-
VIZ-001: TruenoCollector - Renacer visualization integration
- Implements
renacer::Collectortrait for metrics collection - Feeds throughput, IOPS, tier distribution to visualization framework
- Implements
-
VIZ-002:
--visualizeflag - Real-time TUI dashboard- Tier heatmap, throughput gauge, entropy timeline
- Requires
--foregroundmode
-
VIZ-003: Benchmark reports - JSON/HTML export
trueno-ublk benchmark --format json|html|texttrueno-renacer-v1schema for ML pipelines- Self-contained HTML reports with tier distribution charts
-
VIZ-004: OTLP integration - Distributed tracing
--otlp-endpointand--otlp-service-nameflags- Export traces to Jaeger/Tempo
-
Release Verification Matrix (
docs/release_qa_checklist.md)- Falsification-first QA protocol
- Performance thresholds: >7.2 GB/s zero-page, >550K IOPS
Performance
- ZSTD Recommendation: ZSTD-1 is 3x faster than LZ4 on AVX-512
- Compress: 15.4 GiB/s (vs 5.2 GiB/s LZ4)
- Decompress: ~10 GiB/s (vs ~1.5 GiB/s LZ4)
- Usage:
--algorithm zstd
Documentation
- New book chapter: Visualization & Observability
- Updated kernel-zram-parity roadmap (all items COMPLETE)
- New examples:
visualization_demo- TruenoCollector metrics demozstd_vs_lz4- Algorithm performance comparisontiered_storage- Kernel-cooperative architecture demo
Unreleased
Planned
- trueno-zram-adaptive: ML-driven algorithm selection
- trueno-zram-generator: systemd integration
- trueno-zram-cli: zramctl replacement