PCIe 5x Rule
The PCIe 5x rule determines when GPU offload is beneficial for compression.
The Rule
GPU beneficial when: T_compute > 5 × T_transfer
Where:
T_compute= CPU computation timeT_transfer= PCIe transfer time (H2D + D2H)
Why 5x?
GPU offload has overhead:
- H2D transfer: Copy data to GPU
- Kernel launch: ~5-10us overhead
- D2H transfer: Copy results back
- Synchronization: Wait for completion
The 5x factor accounts for these overheads and ensures GPU provides net benefit.
Calculation
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::meets_pcie_rule;
use trueno_zram_core::PAGE_SIZE;
fn check_gpu_benefit(
pages: usize,
pcie_bandwidth_gbps: f64,
gpu_throughput_gbps: f64,
) -> bool {
let data_bytes = pages * PAGE_SIZE;
// Transfer time (round trip)
let transfer_time = 2.0 * data_bytes as f64 / (pcie_bandwidth_gbps * 1e9);
// GPU compute time
let gpu_time = data_bytes as f64 / (gpu_throughput_gbps * 1e9);
// CPU compute time (assume 4 GB/s baseline)
let cpu_time = data_bytes as f64 / (4e9);
// GPU beneficial if saves time
cpu_time > (transfer_time + gpu_time) * 1.2 // 20% margin
}
}
Examples
PCIe 4.0 x16 (25 GB/s)
| Batch | Data Size | Transfer | GPU Time | Beneficial? |
|---|---|---|---|---|
| 100 | 400 KB | 32 us | 4 us | No |
| 1,000 | 4 MB | 320 us | 40 us | Marginal |
| 10,000 | 40 MB | 3.2 ms | 400 us | Yes |
| 100,000 | 400 MB | 32 ms | 4 ms | Yes |
PCIe 5.0 x16 (64 GB/s)
| Batch | Data Size | Transfer | GPU Time | Beneficial? |
|---|---|---|---|---|
| 1,000 | 4 MB | 125 us | 40 us | No |
| 10,000 | 40 MB | 1.25 ms | 400 us | Yes |
| 100,000 | 400 MB | 12.5 ms | 4 ms | Yes |
Checking in Code
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{GpuBatchCompressor, GpuBatchConfig};
let mut compressor = GpuBatchCompressor::new(config)?;
let result = compressor.compress_batch(&pages)?;
if result.pcie_rule_satisfied() {
println!("GPU offload was beneficial");
println!("Kernel time: {} ns", result.kernel_time_ns);
println!("Transfer time: {} ns",
result.h2d_time_ns + result.d2h_time_ns);
} else {
println!("Consider using CPU for this batch size");
}
}
Backend Selection
#![allow(unused)]
fn main() {
use trueno_zram_core::gpu::{select_backend, BackendSelection, gpu_available};
let batch_size = 5000;
let has_gpu = gpu_available();
match select_backend(batch_size, has_gpu) {
BackendSelection::Gpu => {
// Use GPU batch compression
}
BackendSelection::Simd => {
// Use CPU SIMD compression
}
BackendSelection::Scalar => {
// Use scalar compression
}
}
}
Optimization Tips
- Batch larger: Combine small batches into larger ones
- Use async DMA: Overlap transfers with computation
- Profile first: Measure actual transfer times
- Consider hybrid: Use CPU for small batches, GPU for large
Hardware-Specific Thresholds
| GPU | PCIe | Min Beneficial Batch |
|---|---|---|
| H100 | 5.0 x16 | 5,000 pages |
| A100 | 4.0 x16 | 8,000 pages |
| RTX 4090 | 4.0 x16 | 10,000 pages |
| RTX 3090 | 4.0 x16 | 12,000 pages |