GPU Performance
This chapter presents empirical GPU performance findings from benchmarking on NVIDIA RTX 4090, documenting when GPU acceleration provides value versus SIMD.
Executive Summary
Date: 2025-11-23 Hardware: NVIDIA GeForce RTX 4090 (24GB VRAM) Driver: 570.195.03 Platform: Linux 6.8.0-87-generic Software: Trueno v0.7.0, wgpu v27.0.1
Key Findings
- ✅ GPU wins for matrix operations: 81x speedup on 1000×1000 matrix multiplication
- ❌ GPU fails for vector operations: 2000x+ slower than SIMD due to 3.5ms fixed overhead
- 🚀 SIMD vastly superior for vector ops: Zero transfer overhead, 200-400% speedup
- 💡 Hybrid approach recommended: Use SIMD by default, GPU only for matmul >500×500
GPU Transfer Overhead
Fixed Overhead Breakdown
Empirically measured per-operation costs:
| Component | Time | Description |
|---|---|---|
| Buffer creation | ~0.5 ms | Allocate GPU-side memory |
| CPU→GPU transfer | ~1.5 ms | PCIe bandwidth limitation |
| Kernel dispatch | ~0.3 ms | GPU scheduling overhead |
| GPU→CPU readback | ~1.2 ms | PCIe bandwidth limitation |
| Total | ~3.5 ms | Minimum per operation |
Implications for Different Workload Sizes
| Size | Data Volume | Overhead Impact | GPU Viable? |
|---|---|---|---|
| 1K | 4 KB | 875 µs/KB | ❌ Never competitive |
| 10K | 40 KB | 87.5 µs/KB | ❌ Still dominated by overhead |
| 100K | 400 KB | 8.75 µs/KB | ⚠️ Marginal for complex ops |
| 1M | 4 MB | 0.875 µs/KB | ✅ Good amortization |
Rule of thumb: GPU only becomes competitive when compute time >> 3.5ms.
Matrix Multiplication (GPU Excels)
Matrix multiplication has O(n³) complexity, which overwhelms the fixed 3.5ms overhead at large scales.
Benchmark Results
| Size | GPU Time | Scalar Time | Speedup | GPU Throughput | Scalar Throughput |
|---|---|---|---|---|---|
| 100×100 | 4.14 ms | 530.8 µs | 0.13x ❌ | 241.7 Gelem/s | 1.88 Gelem/s |
| 500×500 | 4.59 ms | 77.4 ms | 16.9x ✅ | 27.2 Gelem/s | 1.61 Gelem/s |
| 1000×1000 | 7.84 ms | 638.7 ms | 81.5x ✅ | 127.6 Gelem/s | 1.57 Gelem/s |
Why GPU Wins for Matrix Multiplication
Compute complexity dominates transfer cost:
- 100×100: 1M operations → 531µs scalar → GPU overhead too high
- 500×500: 125M operations → 77ms scalar → GPU wins at 4.6ms
- 1000×1000: 1B operations → 639ms scalar → GPU wins at 7.8ms
Threshold: GPU becomes competitive at >500×500 (250,000 elements).
Vector Operations (GPU Fails)
Simple vector operations are dominated by the 3.5ms fixed transfer overhead.
Vector Addition Results
| Size | GPU Time | Scalar Time | Speedup | GPU Throughput | Scalar Throughput |
|---|---|---|---|---|---|
| 1K | 3.26 ms | 71.0 ns | 0.00002x ❌ | 306.4 Kelem/s | 14.09 Gelem/s |
| 10K | 3.44 ms | 819.0 ns | 0.0002x ❌ | 2.91 Melem/s | 12.21 Gelem/s |
| 100K | 3.51 ms | 10.06 µs | 0.003x ❌ | 28.45 Melem/s | 9.94 Gelem/s |
| 1M | 5.98 ms | 96.5 µs | 0.016x ❌ | 167.3 Melem/s | 10.37 Gelem/s |
Dot Product Results
| Size | GPU Time | Scalar Time | Speedup |
|---|---|---|---|
| 1K | 3.45 ms | 567.4 ns | 0.0002x ❌ |
| 10K | 3.32 ms | 6.30 µs | 0.002x ❌ |
| 100K | 4.81 ms | 63.2 µs | 0.013x ❌ |
| 1M | 6.25 ms | 614.1 µs | 0.098x ❌ |
Key finding: Even at 1M elements, GPU is still 62x slower than scalar due to transfer overhead. Reduction overhead compounds the problem.
Activation Functions
Activation functions are more compute-intensive than simple vector operations, but still suffer from transfer overhead.
ReLU (Simple Operation)
| Size | GPU Time | Scalar Time | Speedup |
|---|---|---|---|
| 10K | 3.49 ms | 559.9 ns | 0.0002x ❌ |
| 100K | 3.75 ms | 6.37 µs | 0.002x ❌ |
| 1M | 6.03 ms | 67.1 µs | 0.011x ❌ |
Sigmoid (Transcendental)
| Size | GPU Time | Scalar Time | Speedup |
|---|---|---|---|
| 10K | 3.64 ms | 20.99 µs | 0.006x ❌ |
| 100K | 3.75 ms | 207.4 µs | 0.055x ❌ |
| 1M | 5.81 ms | 3.18 ms | 0.55x ❌ |
GELU (Very Compute-Heavy)
| Size | GPU Time | Scalar Time | Speedup |
|---|---|---|---|
| 10K | 3.60 ms | 101.2 µs | 0.028x ❌ |
| 100K | 3.72 ms | 327.0 µs | 0.088x ❌ |
| 1M | 5.81 ms | 3.19 ms | 0.55x ❌ |
Key finding: Even compute-heavy operations like GELU and sigmoid are slower on GPU due to transfer overhead. At 1M elements, GPU barely reaches parity with scalar.
Softmax (Multi-Pass Algorithm)
| Size | GPU Time | Scalar Time | Speedup |
|---|---|---|---|
| 10K | 16.75 ms | 29.2 µs | 0.002x ❌ |
| 100K | 16.26 ms | 292.3 µs | 0.018x ❌ |
| 1M | 22.79 ms | 3.01 ms | 0.13x ❌ |
Why softmax is even worse: Multi-pass algorithms require 3 GPU dispatches (max, exp, sum), compounding transfer overhead to ~10ms base cost.
SIMD vs GPU Comparison
Golden traces from Renacer v0.6.2 show SIMD baseline performance:
SIMD Performance (SSE2)
From golden_traces/performance_demo_summary.txt:
| Operation | Size | Scalar | SSE2 | Speedup | Runtime | Syscalls |
|---|---|---|---|---|---|---|
| Dot Product | 10K | 6.26µs | 1.55µs | 303% | 1.507ms | 138 |
| Sum Reduction | 10K | 7.12µs | 1.69µs | 320% | 1.507ms | 138 |
| Max Finding | 10K | 4.19µs | 1.06µs | 297% | 1.507ms | 138 |
| Element-wise Add | 10K | 1.44µs | 1.10µs | 30% | 1.507ms | 138 |
| Element-wise Mul | 10K | 1.10µs | 1.10µs | 0% | 1.507ms | 138 |
Head-to-Head Comparison
| Operation | Size | SIMD (SSE2) | GPU (RTX 4090) | Winner |
|---|---|---|---|---|
| Dot Product | 10K | 1.55µs | 3,324µs | SIMD 2144x faster |
| Vector Add | 10K | 1.10µs | 3,439µs | SIMD 3127x faster |
| Vector Add | 1M | 96.5µs | 5,978µs | SIMD 62x faster |
| Matrix Mul | 1000×1000 | 638.7ms | 7.84ms | GPU 81x faster |
Key Insights
- ✅ SIMD dominates for vector operations at ALL sizes due to zero overhead
- ✅ GPU wins for matrix operations (O(n³) complexity) at large scales
- 💡 Hybrid approach: Use SIMD by default, GPU only for matmul >500×500
Current GPU Thresholds in Trueno
Based on empirical findings, Trueno uses these thresholds:
// src/vector.rs:1316
const GPU_THRESHOLD: usize = usize::MAX; // GPU DISABLED - 2-800x slower
// src/matrix.rs:268
const GPU_THRESHOLD: usize = 500; // Empirical: 2x at 500×500, 9.6x at 1000×1000
Rationale:
- Vector operations: Transfer overhead will always dominate → GPU disabled
- Matrix operations: O(n³) complexity amortizes overhead → GPU at 500×500
When to Use GPU
Use GPU when all of these conditions are met:
- Operation complexity: O(n²) or higher (matrix multiplication, convolution)
- Data size: >500×500 elements for matrix ops
- Compute time: Operation takes >10ms on CPU
- Batch processing: Multiple operations can be batched (future v2.0 API)
GPU is NOT recommended for:
- ❌ Vector operations (add, mul, dot, reduce) - use SIMD
- ❌ Activation functions (relu, sigmoid, tanh) - use SIMD
- ❌ Small matrices (<500×500) - overhead dominates
- ❌ Single operations - transfer overhead too high
GPU Tiled Reduction ✅ (v0.10.1)
Status: Validated on Metal (AMD Radeon Pro W5700X, Mac Pro 7,1)
The tiled reduction shader provides efficient GPU-based sum, max, and min operations using 16x16 workgroup tiles with two-phase reduction.
Metal Benchmark Results (2026-01-03)
| Operation | Size | GPU Tiled | Scalar CPU | GPU Throughput |
|---|---|---|---|---|
| Sum | 1M | 8.25ms | 0.92ms | 121 Melem/s |
| Sum | 10M | 67.2ms | 9.46ms | 149 Melem/s |
| Sum | 32M | 215ms | 30.7ms | 149 Melem/s |
| Max | 1M | 8.3ms | 0.22ms | 120 Melem/s |
| Max | 10M | 67ms | 3.25ms | 150 Melem/s |
| Max | 32M | 215ms | 10.7ms | 149 Melem/s |
| Min | 1M | 8.28ms | 0.22ms | 121 Melem/s |
| Min | 10M | 67.2ms | 3.26ms | 149 Melem/s |
| Min | 32M | 215ms | 10.7ms | 149 Melem/s |
Key Findings
- Consistent ~150 Melem/s throughput across all sizes on GPU
- ~8ms baseline overhead from CPU→GPU transfer
- CPU is 7-37x faster for standalone reductions (expected for O(n) ops)
- GPU wins for O(n³) operations like matmul, but loses for O(n) reductions
When GPU Tiled Reduction is Optimal
✅ Use GPU reduction when:
- Data is already resident on GPU (no transfer cost)
- Reduction is part of larger GPU compute pipeline
- Latency hiding in async GPU workloads
❌ Prefer SIMD when:
- Data starts on CPU (transfer overhead dominates)
- Standalone reduction operation
- Low-latency required
Metal Buffer Limits
| Limit | Value | Max f32 Elements |
|---|---|---|
| Buffer binding | 128 MB | ~32M elements |
| Total buffer | 256 MB | ~64M elements |
CUDA PTX Validation ✅ (v0.10.1)
Status: Validated on NVIDIA GeForce RTX 4090 (Ada Lovelace, sm_89)
The trueno-gpu PTX code generation has been validated on real CUDA hardware, confirming JIT compilation and execution correctness.
RTX 4090 Validation Results (2026-01-03)
| Kernel | PTX Size | Lines | Status |
|---|---|---|---|
| gemm_naive_64 | 1.6 KB | 66 | ✅ PASS |
| gemm_tiled_128 | 2.6 KB | 104 | ✅ PASS |
| gemm_tensor_core | 7.8 KB | 273 | ✅ PASS |
| gemm_wmma_fp16 | 3.8 KB | 128 | ✅ PASS |
| softmax_1024 | 1.8 KB | 59 | ✅ PASS |
| layernorm_1024 | 2.8 KB | 94 | ✅ PASS |
| attention_64_64 | 3.9 KB | 146 | ✅ PASS |
| q4k_32 | 4.3 KB | 158 | ✅ PASS |
Kernel Generation Throughput
68,015 kernels/sec measured via bench_kernel_gen example.
| Kernel Type | Generation Time | Size |
|---|---|---|
| gemm_naive | 9.11 µs | 1.6 KB |
| gemm_tiled | 15.01 µs | 2.6 KB |
| gemm_tensor_core | 44.33 µs | 7.8 KB |
| attention | 23.00 µs | 3.9 KB |
| q4k_quantized | 28.43 µs | 4.3 KB |
Execution Verification
Simple Attention CUDA kernel verified with numerical accuracy:
- GPU execution: 134µs (16x16 sequence)
- Max difference: 2.98e-8 (vs CPU reference)
- Status: PASS
PTX Features Validated
- ✅ FMA fusion (mul+add → fma.rn.f32)
- ✅ F16 conversion (cvt.rn.f16.f32)
- ✅ Shared memory (smem with .align)
- ✅ WMMA Tensor Core ops
- ✅ Q4K quantization (4-bit dequantize)
- ✅ Tree reduction patterns
- ✅ Predicated execution (@%p bra)
Running CUDA Examples
# CUDA monitoring (device info, memory stats)
cargo run --example cuda_monitor --features cuda --release
# PTX generation benchmarks
cargo run --example bench_kernel_gen --features cuda --release
# Simple attention execution
cargo run --example simple_attention_cuda --features cuda --release
# Quantized GEMM PTX
cargo run --example q4k_gemm --features cuda --release
Example Usage
use trueno::backends::gpu::GpuBackend;
fn main() -> Result<(), String> {
let mut gpu = GpuBackend::new();
// Create 1000x1000 matrix
let data: Vec<f32> = vec![1.0; 1_000_000];
// GPU tiled sum reduction
let sum = gpu.tiled_sum_2d_gpu(&data, 1000, 1000)?;
println!("Sum: {}", sum); // 1000000.0
// GPU tiled max/min
let max = gpu.tiled_max_2d_gpu(&data, 1000, 1000)?;
let min = gpu.tiled_min_2d_gpu(&data, 1000, 1000)?;
Ok(())
}
# Run the demonstration
cargo run --example gpu_tiled_reduction --features gpu --release
Benchmark Execution
# Run tiled reduction benchmarks
cargo bench --features gpu --bench gpu_reduction
Async Batch API ✅ (v0.3.0 - AVAILABLE NOW)
Status: Fully implemented and tested (previously documented as "Future v2.0")
The async batch API solves the transfer overhead problem by queuing multiple operations and executing them in a single batch, amortizing the 3.5ms overhead across all operations.
Transfer Overhead Reduction
Traditional Synchronous API (current default):
// ❌ 3 operations = 3 × 3.5ms = 10.5ms overhead
let a = gpu.vec_add(&input1, &input2)?; // Upload → Compute → Download
let b = gpu.scale(&a, 2.0)?; // Upload → Compute → Download
let c = gpu.relu(&b)?; // Upload → Compute → Download
// Total: 6 GPU transfers (3 uploads + 3 downloads)
Async Batch API (recommended for chained operations):
use trueno::backends::gpu::{GpuDevice, GpuCommandBatch};
// ✅ 3 operations = 1 × 3.5ms = 3.5ms overhead
let device = GpuDevice::new()?;
let mut batch = GpuCommandBatch::new(device);
// Queue operations (no GPU execution yet!)
let input = batch.upload(&[1.0, 2.0, -3.0, 4.0]);
let a = batch.add(input, other);
let b = batch.scale(a, 2.0);
let c = batch.relu(b);
// Execute entire batch in one GPU round-trip
batch.execute().await?;
// Read final result
let result = batch.read(c).await?;
// Total: 2 GPU transfers (1 upload + 1 download)
Performance Benefits
| Metric | Traditional API | Batch API | Improvement |
|---|---|---|---|
| GPU Transfers | 6 (3↑ + 3↓) | 2 (1↑ + 1↓) | 3x fewer |
| Overhead | 3 × 3.5ms = 10.5ms | 1 × 3.5ms = 3.5ms | 3x reduction |
| Expected Speedup | Baseline | 1.5-2x faster | For GPU-bound workloads |
When to Use Batch API
✅ Use batch API when:
- Chaining multiple GPU operations (>2 ops)
- Processing large workloads where GPU is beneficial (matmul >500×500)
- Amortizing transfer overhead is critical
❌ Stick with traditional API when:
- Single operation only
- Interactive/real-time workloads requiring immediate results
- Workloads small enough that SIMD is faster anyway
Complete Example
See examples/gpu_batch_demo.rs for three comprehensive demonstrations:
- Single Operation - Baseline batch API usage
- Batched Operations - ReLU → Scale → Add pipeline
- ML Pipeline -
y = ReLU(x * W + b)simulation
# Run the demonstration
cargo run --example gpu_batch_demo --features gpu --release
Implementation Details
- Location:
src/backends/gpu/batch.rs(1,008 lines) - Tests: 8 comprehensive tests (all passing)
- Operations: relu, scale, add, mul, dot
- API: Fully async with tokio integration
- Safety: Type-safe buffer IDs prevent invalid operations
Future Enhancements (v0.4.0+)
While the batch API is complete, future improvements may include:
- Automatic optimization: Detect operation chains and auto-batch
- More operations: Expand beyond current 5 operations (relu, scale, add, mul, dot)
- Graph optimization: Reorder operations for maximum efficiency
- Multi-GPU: Distribute batches across multiple GPUs
- Persistent buffers: Reuse buffers across multiple batch executions
Hardware Details
GPU: NVIDIA GeForce RTX 4090
├─ Architecture: Ada Lovelace
├─ CUDA Cores: 16,384
├─ Memory: 24GB GDDR6X
├─ Memory Bandwidth: 1,008 GB/s
├─ Boost Clock: 2.52 GHz
└─ TDP: 450W
Driver: 570.195.03
Platform: Linux 6.8.0-87-generic (x86_64)
Validation and Testing
Quality Gates
- ✅ All 13 GPU operations benchmarked
- ✅ 4 size ranges tested per operation
- ✅ Statistical significance (10 samples, CV <5%)
- ✅ Comparison against scalar baseline
- ✅ Clippy: Zero warnings
- ✅ Coverage: 90.40% (≥90% threshold)
- ✅ GPU initialization verified
- ✅ Correctness tests pass
Golden Trace Integration
Performance budgets established via renacer.toml:
[performance.budgets]
# SIMD operations should complete in <2ms with <200 syscalls
backend_detection = { max_time_ms = 2.0, max_syscalls = 200 }
matrix_operations = { max_time_ms = 2.0, max_syscalls = 200 }
activation_functions = { max_time_ms = 2.0, max_syscalls = 200 }
Validation tests in tests/golden_trace_validation.rs ensure SIMD performance doesn't regress.
Recommendations
Immediate Actions
- Use SIMD by default for all vector operations
- Reserve GPU for matrix operations >500×500
- Document transfer overhead prominently in API docs
- Educate users that GPU is not always faster
Future Enhancements (v2.0)
- Async batch API to amortize transfer overhead
- Persistent GPU buffers for frequently-used data
- Hybrid CPU/GPU scheduling with overlap
- Profile-guided optimization for dynamic thresholds
References
- Full benchmark report:
docs/gpu-benchmark-report-2025-11-23.md - Golden traces:
golden_traces/directory - Golden trace analysis:
golden_traces/ANALYSIS.md - SIMD performance:
golden_traces/performance_demo_summary.txt - Renacer configuration:
renacer.toml - GPU bug fix: Commit b5ca0af (missing device.poll() in wgpu v27)
WebGPU for WASM (v0.7.3)
Trueno v0.7.3 introduces the gpu-wasm feature enabling GPU compute in browsers via WebGPU.
Feature Flag
[target.'cfg(target_arch = "wasm32")'.dependencies]
trueno = { version = "0.7.3", features = ["gpu-wasm"] }
Platform Differences
| Platform | Sync API | Async API | Runtime |
|---|---|---|---|
| Native | ✅ GpuDevice::new() | ✅ new_async() | pollster |
| WASM | ❌ (can't block) | ✅ new_async() | wasm-bindgen-futures |
Async-First Design
All GPU operations now have async variants (*_async) that work on both native and WASM:
// Works on all platforms
let device = GpuDevice::new_async().await?;
device.matmul_async(&a, &b, &mut result, m, k, n).await?;
device.relu_async(&input, &mut output).await?;
Runtime Detection
use trueno::backends::gpu::runtime;
if runtime::sync_available() {
// Native: can use sync APIs
let device = GpuDevice::new()?;
} else {
// WASM: must use async
let device = GpuDevice::new_async().await?;
}
Real-World Example: trueno-viz
trueno-viz demonstrates browser-based GPU compute with Trueno:
- WebGPU-accelerated matrix operations
- WASM-compiled Rust for client-side processing
- Interactive visualizations with GPU compute
See GPU Backend Architecture for complete WebGPU documentation.
Next Steps
- Backend Comparison - Detailed SIMD vs GPU trade-offs
- Benchmarks Overview - Complete benchmark methodology
- Optimization Guide - How to choose the right backend
- Profiling - Using Renacer for performance analysis