Tuning Guide
Algorithm Selection
LZ4 vs ZSTD
| Workload | Recommended | Reason |
|---|---|---|
| Real-time | LZ4 | Lowest latency |
| Memory-constrained | ZSTD | Better ratio |
| Mixed content | Adaptive | Auto-selects |
| Highly compressible | ZSTD-3 | Best ratio |
Code Example
#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, Algorithm};
// For latency-sensitive workloads
let fast = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
// For memory-constrained systems
let compact = CompressorBuilder::new()
.algorithm(Algorithm::Zstd { level: 3 })
.build()?;
// For unknown workloads
let adaptive = CompressorBuilder::new()
.algorithm(Algorithm::Adaptive)
.build()?;
}
SIMD Backend Selection
The library auto-detects the best backend, but you can force one:
#![allow(unused)]
fn main() {
use trueno_zram_core::{CompressorBuilder, SimdBackend};
// Force AVX2 (e.g., for testing)
let compressor = CompressorBuilder::new()
.prefer_backend(SimdBackend::Avx2)
.build()?;
}
GPU Batch Sizing
Optimal Batch Size
| GPU | L2 Cache | Optimal Batch |
|---|---|---|
| H100 | 50 MB | 10,240 pages |
| A100 | 40 MB | 8,192 pages |
| RTX 4090 | 72 MB | 14,745 pages |
| RTX 3090 | 6 MB | 1,200 pages |
Calculation
#![allow(unused)]
fn main() {
fn optimal_batch_size(l2_cache_bytes: usize) -> usize {
// Each page needs ~3KB working memory
// Target 80% L2 cache utilization
(l2_cache_bytes * 80 / 100) / (3 * 1024)
}
}
Async DMA
Enable async DMA for overlapping transfers:
#![allow(unused)]
fn main() {
let config = GpuBatchConfig {
async_dma: true,
ring_buffer_slots: 4, // Pipeline depth
..Default::default()
};
}
Benefits:
- Overlaps H2D, compute, D2H
- 20-30% throughput improvement
- Higher GPU utilization
Memory Configuration
Working Memory
#![allow(unused)]
fn main() {
// Per-thread memory usage
const LZ4_HASH_TABLE: usize = 64 * 1024; // 64 KB
const ZSTD_CONTEXT: usize = 256 * 1024; // 256 KB
const WORKING_BUFFER: usize = 16 * 1024; // 16 KB
}
Reducing Memory
For memory-constrained systems:
#![allow(unused)]
fn main() {
// Use LZ4 (smaller context)
.algorithm(Algorithm::Lz4)
// Smaller batch sizes
let config = GpuBatchConfig {
batch_size: 100,
..Default::default()
};
}
Monitoring Performance
#![allow(unused)]
fn main() {
let compressor = CompressorBuilder::new()
.algorithm(Algorithm::Lz4)
.build()?;
// Process pages...
let stats = compressor.stats();
// Check throughput
if stats.throughput_gbps() < 3.0 {
println!("Warning: Low throughput, check CPU affinity");
}
// Check ratio
if stats.ratio() < 1.5 {
println!("Warning: Low compression, data may be incompressible");
}
}
CPU Affinity
For best performance, pin compression threads to physical cores:
# Linux: Pin to cores 0-3
taskset -c 0-3 ./my_app
# Check NUMA topology
numactl --hardware
Kernel Parameters
For zram integration:
# Increase zram size
echo $((8 * 1024 * 1024 * 1024)) > /sys/block/zram0/disksize
# Set compression streams
echo 4 > /sys/block/zram0/max_comp_streams
# Enable writeback
echo 1 > /sys/block/zram0/writeback