Backend Comparison
This chapter compares Trueno's three execution backends (Scalar, SIMD, GPU) across different operation types and workload sizes, providing guidance on when to use each.
Backend Overview
| Backend | Availability | Typical Speedup | Best Use Case |
|---|---|---|---|
| Scalar | All platforms | 1x (baseline) | Small workloads, reference implementation |
| SIMD | x86_64 (SSE2+), ARM (NEON), WASM | 2-4x | Most operations, <1M elements |
| GPU | Vulkan/Metal/DX12 systems | 10-80x | Large matrix ops (>500×500) |
Decision Matrix
Use this table to choose the optimal backend for your workload:
| Operation Type | Size Range | Recommended Backend | Expected Speedup |
|---|---|---|---|
| Vector Add/Mul | Any | SIMD | 1.1-1.3x |
| Dot Product | <1M | SIMD | 3-4x |
| Dot Product | >1M | SIMD | 3-4x |
| Matrix Mul | <500×500 | SIMD | 2-4x |
| Matrix Mul | 500×500-1000×1000 | GPU | 16-81x |
| Matrix Mul | >1000×1000 | GPU | 80x+ |
| Activations (ReLU, Sigmoid) | Any | SIMD | 1.2-7x |
| Reductions (Sum, Max) | Any | SIMD | 3-4x |
Scalar Backend
Characteristics
-
Pros:
- Zero overhead
- Simple, maintainable code
- Predictable performance
- Works everywhere
-
Cons:
- No parallelism
- Slowest for compute-heavy operations
When to Use Scalar
- Reference implementation for correctness testing
- Platforms without SIMD support (rare)
- Debugging (simpler code paths)
- Very small workloads (<100 elements) where SIMD overhead dominates
Performance
| Operation | Size | Time | Throughput |
|---|---|---|---|
| Vector Add | 10K | 819 ns | 12.21 Gelem/s |
| Dot Product | 10K | 6.30 µs | 1.59 Gelem/s |
| Matrix Mul | 1000×1000 | 638.7 ms | 1.57 Gelem/s |
SIMD Backend
Characteristics
-
Pros:
- Zero transfer overhead
- 2-4x speedup for most operations
- Low latency (<10µs for typical ops)
- Works on all modern CPUs
-
Cons:
- Limited parallelism (4-8 elements)
- Complex implementation
- Platform-specific code
SIMD Instruction Sets
| ISA | Register Width | Elements (f32) | Availability |
|---|---|---|---|
| SSE2 | 128-bit | 4 | All x86_64 CPUs |
| AVX | 256-bit | 8 | Intel Sandy Bridge+ (2011+) |
| AVX2 | 256-bit + FMA | 8 | Intel Haswell+ (2013+) |
| AVX-512 | 512-bit | 16 | Intel Skylake-X+ (2017+), AMD Zen 4+ (2022+) |
| NEON | 128-bit | 4 | All ARM64 CPUs |
| SIMD128 | 128-bit | 4 | Modern browsers (WASM) |
SIMD Performance (SSE2)
From golden traces (golden_traces/performance_demo_summary.txt):
| Operation | Size | Scalar | SIMD (SSE2) | Speedup | Runtime | Syscalls |
|---|---|---|---|---|---|---|
| Dot Product | 10K | 6.26µs | 1.55µs | 4.0x ✅ | 1.507ms | 138 |
| Sum Reduction | 10K | 7.12µs | 1.69µs | 4.2x ✅ | 1.507ms | 138 |
| Max Finding | 10K | 4.19µs | 1.06µs | 4.0x ✅ | 1.507ms | 138 |
| Element-wise Add | 10K | 1.44µs | 1.10µs | 1.3x | 1.507ms | 138 |
| Element-wise Mul | 10K | 1.10µs | 1.10µs | 1.0x | 1.507ms | 138 |
Why SIMD Excels
Zero overhead architecture:
- No data transfer (operates directly on CPU cache)
- No synchronization (single-threaded execution)
- Immediate execution (no queuing or dispatch)
Optimal for:
- ✅ Reduction operations (dot, sum, max): Parallel accumulation
- ✅ Compute-intensive ops (tanh, sigmoid): Amortizes instruction overhead
- ⚠️ Memory-bound ops (add, mul): Limited by RAM bandwidth, not compute
GPU Backend
Characteristics
-
Pros:
- Massive parallelism (thousands of cores)
- 80x+ speedup for large matrix operations
- Excellent for O(n³) algorithms
-
Cons:
- 3.5ms fixed overhead per operation
- Requires PCIe transfer (CPU↔GPU)
- Only beneficial for large workloads
- Not always available
GPU Transfer Overhead
Critical limitation: Every GPU operation incurs ~3.5ms fixed cost:
| Component | Time | Description |
|---|---|---|
| Buffer creation | 0.5 ms | Allocate GPU-side memory |
| CPU→GPU transfer | 1.5 ms | PCIe bandwidth limitation |
| Kernel dispatch | 0.3 ms | GPU scheduling |
| GPU→CPU readback | 1.2 ms | PCIe bandwidth limitation |
| Total | 3.5 ms | Minimum per operation |
GPU Performance (RTX 4090)
Vector operations (❌ GPU fails):
| Operation | Size | GPU Time | SIMD Time | Verdict |
|---|---|---|---|---|
| Vector Add | 10K | 3.44 ms | 1.10 µs | SIMD 3127x faster |
| Dot Product | 10K | 3.32 ms | 1.55 µs | SIMD 2144x faster |
| ReLU | 1M | 6.03 ms | 67.1 µs | SIMD 90x faster |
| Sigmoid | 1M | 5.81 ms | 3.18 ms | SIMD 1.8x faster |
Matrix operations (✅ GPU wins):
| Size | GPU Time | Scalar Time | Speedup |
|---|---|---|---|
| 100×100 | 4.14 ms | 530.8 µs | 0.13x ❌ |
| 500×500 | 4.59 ms | 77.4 ms | 16.9x ✅ |
| 1000×1000 | 7.84 ms | 638.7 ms | 81.5x ✅ |
Why GPU Fails for Vector Operations
Transfer overhead dominates:
- 10K vector add: 1.1µs compute vs 3500µs transfer → 3182x overhead
- 1M vector add: 96.5µs compute vs 3500µs transfer → 36x overhead
Even compute-heavy ops suffer:
- 1M sigmoid: 3.18ms compute vs 3.5ms transfer → Barely competitive
Why GPU Wins for Matrix Operations
O(n³) complexity overwhelms transfer cost:
- 500×500 matmul: 125M ops → 77ms scalar → GPU wins at 4.6ms (13x amortization)
- 1000×1000 matmul: 1B ops → 639ms scalar → GPU wins at 7.8ms (81x amortization)
GPU becomes competitive when: compute_time_scalar > 10 × transfer_overhead
For matrix multiplication:
- 500×500: 77ms compute >> 3.5ms transfer → GPU wins
- 100×100: 531µs compute << 3.5ms transfer → GPU loses
Backend Comparison by Operation Type
Element-Wise Operations (add, mul, scale)
| Backend | Typical Time (10K) | Speedup vs Scalar | Verdict |
|---|---|---|---|
| Scalar | 800 ns | 1.0x | Baseline |
| SIMD | 600 ns | 1.3x | ✅ Use SIMD |
| GPU | 3400 µs | 0.0002x | ❌ Never use GPU |
Recommendation: Always use SIMD. Memory-bound, but SIMD has zero overhead.
Reduction Operations (dot, sum, max)
| Backend | Typical Time (10K) | Speedup vs Scalar | Verdict |
|---|---|---|---|
| Scalar | 6.3 µs | 1.0x | Baseline |
| SIMD | 1.5 µs | 4.0x | ✅ Use SIMD |
| GPU | 3320 µs | 0.002x | ❌ Never use GPU |
Recommendation: Always use SIMD. Excellent parallel accumulation, zero overhead.
Activation Functions (relu, sigmoid, tanh)
| Backend | Typical Time (1M) | Speedup vs Scalar | Verdict |
|---|---|---|---|
| Scalar (ReLU) | 67.1 µs | 1.0x | Baseline |
| SIMD (ReLU) | ~20 µs | ~3x | ✅ Use SIMD |
| GPU (ReLU) | 6030 µs | 0.011x | ❌ Never use GPU |
| Scalar (Sigmoid) | 3.18 ms | 1.0x | Baseline |
| SIMD (Sigmoid) | ~1 ms | ~3x | ✅ Use SIMD |
| GPU (Sigmoid) | 5.81 ms | 0.55x | ❌ Never use GPU |
Recommendation: Always use SIMD, even for compute-heavy activations.
Matrix Multiplication
| Backend | Time (1000×1000) | Speedup vs Scalar | Verdict |
|---|---|---|---|
| Scalar | 638.7 ms | 1.0x | Baseline |
| SIMD | ~160 ms | ~4x | ✅ Use for <500×500 |
| GPU | 7.84 ms | 81.5x | ✅ Use for >500×500 |
Recommendation: Use GPU for matrices >500×500, otherwise SIMD.
Threshold Guidelines
Current Trueno Thresholds
// Vector operations (src/vector.rs:1316)
const GPU_THRESHOLD: usize = usize::MAX; // GPU DISABLED
// Matrix operations (src/matrix.rs:268)
const GPU_THRESHOLD: usize = 500; // 500×500 minimum
Size-Based Recommendations
| Workload Size | Vector Ops | Matrix Ops | Rationale |
|---|---|---|---|
| <100 | Scalar/SIMD | Scalar/SIMD | SIMD overhead marginal |
| 100-1K | SIMD | SIMD | Sweet spot for SIMD |
| 1K-100K | SIMD | SIMD | SIMD still optimal |
| 100K-500×500 | SIMD | SIMD | GPU overhead too high |
| 500×500-1000×1000 | SIMD | GPU | O(n³) amortizes overhead |
| >1000×1000 | SIMD | GPU | Massive compute dominates |
Operation Complexity Classes
Trueno categorizes operations by complexity:
pub enum OpComplexity {
Low, // Simple ops: add, mul (GPU disabled)
Medium, // Moderate: dot, reduce (GPU at 100K+)
High, // Complex: matmul, conv2d (GPU at 500×500+)
}
Performance Validation
Golden Trace Baselines
Performance budgets in renacer.toml ensure SIMD doesn't regress:
[performance.budgets]
backend_detection = { max_time_ms = 2.0, max_syscalls = 200 }
matrix_operations = { max_time_ms = 2.0, max_syscalls = 200 }
activation_functions = { max_time_ms = 2.0, max_syscalls = 200 }
All SIMD operations must complete in <2ms with <200 syscalls.
Validation Tests
tests/golden_trace_validation.rs ensures:
- SIMD performance matches golden traces (±10%)
- No unexpected syscall patterns
- Runtime stays under budget
Future: Hybrid Scheduling (v2.0)
Current API forces a single backend per operation. Future hybrid scheduling will:
- Profile operation characteristics at runtime
- Dynamically select backend based on actual compute time
- Batch GPU operations to amortize transfer overhead
- Overlap CPU and GPU work for pipeline parallelism
Example future API:
let scheduler = HybridScheduler::new()
.prefer_simd_threshold_ms(5.0) // Use SIMD if op <5ms
.gpu_batch_window_ms(10.0); // Batch GPU ops within 10ms
scheduler.execute_pipeline(|pipe| {
let a = pipe.add(&x, &y); // SIMD (fast)
let b = pipe.dot(&a, &z); // SIMD (fast)
let c = pipe.matmul(&b, &w); // GPU (queued)
let d = pipe.matmul(&c, &v); // GPU (batched!)
d
});
Recommendations Summary
For Vector Operations
- Always use SIMD - Zero overhead, 2-4x speedup
- Never use GPU - 2000x+ slower due to transfer overhead
- Use scalar only for <100 elements or debugging
For Matrix Operations
- Use SIMD for matrices <500×500
- Use GPU for matrices ≥500×500 (16-81x speedup)
- Consider batching multiple GPU operations in future
General Guidelines
- Latency-critical: Always SIMD (microsecond-scale)
- Throughput-critical: GPU for large batches, SIMD otherwise
- Portable: SIMD works everywhere (x86, ARM, WASM)
- Maximum performance: Profile and choose dynamically
References
- GPU Performance - Detailed GPU benchmarks (RTX 4090)
- SIMD Performance - SIMD optimization techniques
- Benchmarks Overview - Complete benchmark methodology
- Full report:
docs/gpu-benchmark-report-2025-11-23.md - Golden traces:
golden_traces/ANALYSIS.md - Configuration:
renacer.toml