Backend Comparison

This chapter compares Trueno's three execution backends (Scalar, SIMD, GPU) across different operation types and workload sizes, providing guidance on when to use each.

Backend Overview

Backend	Availability	Typical Speedup	Best Use Case
Scalar	All platforms	1x (baseline)	Small workloads, reference implementation
SIMD	x86_64 (SSE2+), ARM (NEON), WASM	2-4x	Most operations, <1M elements
GPU	Vulkan/Metal/DX12 systems	10-80x	Large matrix ops (>500×500)

Decision Matrix

Use this table to choose the optimal backend for your workload:

Operation Type	Size Range	Recommended Backend	Expected Speedup
Vector Add/Mul	Any	SIMD	1.1-1.3x
Dot Product	<1M	SIMD	3-4x
Dot Product	>1M	SIMD	3-4x
Matrix Mul	<500×500	SIMD	2-4x
Matrix Mul	500×500-1000×1000	GPU	16-81x
Matrix Mul	>1000×1000	GPU	80x+
Activations (ReLU, Sigmoid)	Any	SIMD	1.2-7x
Reductions (Sum, Max)	Any	SIMD	3-4x

Scalar Backend

Characteristics

Pros:
- Zero overhead
- Simple, maintainable code
- Predictable performance
- Works everywhere
Cons:
- No parallelism
- Slowest for compute-heavy operations

When to Use Scalar

Reference implementation for correctness testing
Platforms without SIMD support (rare)
Debugging (simpler code paths)
Very small workloads (<100 elements) where SIMD overhead dominates

Performance

Operation	Size	Time	Throughput
Vector Add	10K	819 ns	12.21 Gelem/s
Dot Product	10K	6.30 µs	1.59 Gelem/s
Matrix Mul	1000×1000	638.7 ms	1.57 Gelem/s

SIMD Backend

Characteristics

Pros:
- Zero transfer overhead
- 2-4x speedup for most operations
- Low latency (<10µs for typical ops)
- Works on all modern CPUs
Cons:
- Limited parallelism (4-8 elements)
- Complex implementation
- Platform-specific code

SIMD Instruction Sets

ISA	Register Width	Elements (f32)	Availability
SSE2	128-bit	4	All x86_64 CPUs
AVX	256-bit	8	Intel Sandy Bridge+ (2011+)
AVX2	256-bit + FMA	8	Intel Haswell+ (2013+)
AVX-512	512-bit	16	Intel Skylake-X+ (2017+), AMD Zen 4+ (2022+)
NEON	128-bit	4	All ARM64 CPUs
SIMD128	128-bit	4	Modern browsers (WASM)

SIMD Performance (SSE2)

From golden traces (golden_traces/performance_demo_summary.txt):

Operation	Size	Scalar	SIMD (SSE2)	Speedup	Runtime	Syscalls
Dot Product	10K	6.26µs	1.55µs	4.0x ✅	1.507ms	138
Sum Reduction	10K	7.12µs	1.69µs	4.2x ✅	1.507ms	138
Max Finding	10K	4.19µs	1.06µs	4.0x ✅	1.507ms	138
Element-wise Add	10K	1.44µs	1.10µs	1.3x	1.507ms	138
Element-wise Mul	10K	1.10µs	1.10µs	1.0x	1.507ms	138

Why SIMD Excels

Zero overhead architecture:

No data transfer (operates directly on CPU cache)
No synchronization (single-threaded execution)
Immediate execution (no queuing or dispatch)

Optimal for:

✅ Reduction operations (dot, sum, max): Parallel accumulation
✅ Compute-intensive ops (tanh, sigmoid): Amortizes instruction overhead
⚠️ Memory-bound ops (add, mul): Limited by RAM bandwidth, not compute

GPU Backend

Characteristics

Pros:
- Massive parallelism (thousands of cores)
- 80x+ speedup for large matrix operations
- Excellent for O(n³) algorithms
Cons:
- 3.5ms fixed overhead per operation
- Requires PCIe transfer (CPU↔GPU)
- Only beneficial for large workloads
- Not always available

GPU Transfer Overhead

Critical limitation: Every GPU operation incurs ~3.5ms fixed cost:

Component	Time	Description
Buffer creation	0.5 ms	Allocate GPU-side memory
CPU→GPU transfer	1.5 ms	PCIe bandwidth limitation
Kernel dispatch	0.3 ms	GPU scheduling
GPU→CPU readback	1.2 ms	PCIe bandwidth limitation
Total	3.5 ms	Minimum per operation

GPU Performance (RTX 4090)

Vector operations (❌ GPU fails):

Operation	Size	GPU Time	SIMD Time	Verdict
Vector Add	10K	3.44 ms	1.10 µs	SIMD 3127x faster
Dot Product	10K	3.32 ms	1.55 µs	SIMD 2144x faster
ReLU	1M	6.03 ms	67.1 µs	SIMD 90x faster
Sigmoid	1M	5.81 ms	3.18 ms	SIMD 1.8x faster

Matrix operations (✅ GPU wins):

Size	GPU Time	Scalar Time	Speedup
100×100	4.14 ms	530.8 µs	0.13x ❌
500×500	4.59 ms	77.4 ms	16.9x ✅
1000×1000	7.84 ms	638.7 ms	81.5x ✅

Why GPU Fails for Vector Operations

Transfer overhead dominates:

10K vector add: 1.1µs compute vs 3500µs transfer → 3182x overhead
1M vector add: 96.5µs compute vs 3500µs transfer → 36x overhead

Even compute-heavy ops suffer:

1M sigmoid: 3.18ms compute vs 3.5ms transfer → Barely competitive

Why GPU Wins for Matrix Operations

O(n³) complexity overwhelms transfer cost:

500×500 matmul: 125M ops → 77ms scalar → GPU wins at 4.6ms (13x amortization)
1000×1000 matmul: 1B ops → 639ms scalar → GPU wins at 7.8ms (81x amortization)

GPU becomes competitive when: compute_time_scalar > 10 × transfer_overhead

For matrix multiplication:

500×500: 77ms compute >> 3.5ms transfer → GPU wins
100×100: 531µs compute << 3.5ms transfer → GPU loses

Backend Comparison by Operation Type

Element-Wise Operations (add, mul, scale)

Backend	Typical Time (10K)	Speedup vs Scalar	Verdict
Scalar	800 ns	1.0x	Baseline
SIMD	600 ns	1.3x	✅ Use SIMD
GPU	3400 µs	0.0002x	❌ Never use GPU

Recommendation: Always use SIMD. Memory-bound, but SIMD has zero overhead.

Reduction Operations (dot, sum, max)

Backend	Typical Time (10K)	Speedup vs Scalar	Verdict
Scalar	6.3 µs	1.0x	Baseline
SIMD	1.5 µs	4.0x	✅ Use SIMD
GPU	3320 µs	0.002x	❌ Never use GPU

Recommendation: Always use SIMD. Excellent parallel accumulation, zero overhead.

Activation Functions (relu, sigmoid, tanh)

Backend	Typical Time (1M)	Speedup vs Scalar	Verdict
Scalar (ReLU)	67.1 µs	1.0x	Baseline
SIMD (ReLU)	~20 µs	~3x	✅ Use SIMD
GPU (ReLU)	6030 µs	0.011x	❌ Never use GPU

Scalar (Sigmoid)	3.18 ms	1.0x	Baseline
SIMD (Sigmoid)	~1 ms	~3x	✅ Use SIMD
GPU (Sigmoid)	5.81 ms	0.55x	❌ Never use GPU

Recommendation: Always use SIMD, even for compute-heavy activations.

Matrix Multiplication

Backend	Time (1000×1000)	Speedup vs Scalar	Verdict
Scalar	638.7 ms	1.0x	Baseline
SIMD	~160 ms	~4x	✅ Use for <500×500
GPU	7.84 ms	81.5x	✅ Use for >500×500

Recommendation: Use GPU for matrices >500×500, otherwise SIMD.

Threshold Guidelines

Current Trueno Thresholds

// Vector operations (src/vector.rs:1316)
const GPU_THRESHOLD: usize = usize::MAX; // GPU DISABLED

// Matrix operations (src/matrix.rs:268)
const GPU_THRESHOLD: usize = 500; // 500×500 minimum

Size-Based Recommendations

Workload Size	Vector Ops	Matrix Ops	Rationale
<100	Scalar/SIMD	Scalar/SIMD	SIMD overhead marginal
100-1K	SIMD	SIMD	Sweet spot for SIMD
1K-100K	SIMD	SIMD	SIMD still optimal
100K-500×500	SIMD	SIMD	GPU overhead too high
500×500-1000×1000	SIMD	GPU	O(n³) amortizes overhead
>1000×1000	SIMD	GPU	Massive compute dominates

Operation Complexity Classes

Trueno categorizes operations by complexity:

pub enum OpComplexity {
    Low,    // Simple ops: add, mul (GPU disabled)
    Medium, // Moderate: dot, reduce (GPU at 100K+)
    High,   // Complex: matmul, conv2d (GPU at 500×500+)
}

Performance Validation

Golden Trace Baselines

Performance budgets in renacer.toml ensure SIMD doesn't regress:

[performance.budgets]
backend_detection = { max_time_ms = 2.0, max_syscalls = 200 }
matrix_operations = { max_time_ms = 2.0, max_syscalls = 200 }
activation_functions = { max_time_ms = 2.0, max_syscalls = 200 }

All SIMD operations must complete in <2ms with <200 syscalls.

Validation Tests

tests/golden_trace_validation.rs ensures:

SIMD performance matches golden traces (±10%)
No unexpected syscall patterns
Runtime stays under budget

Future: Hybrid Scheduling (v2.0)

Current API forces a single backend per operation. Future hybrid scheduling will:

Profile operation characteristics at runtime
Dynamically select backend based on actual compute time
Batch GPU operations to amortize transfer overhead
Overlap CPU and GPU work for pipeline parallelism

Example future API:

let scheduler = HybridScheduler::new()
    .prefer_simd_threshold_ms(5.0)  // Use SIMD if op <5ms
    .gpu_batch_window_ms(10.0);     // Batch GPU ops within 10ms

scheduler.execute_pipeline(|pipe| {
    let a = pipe.add(&x, &y);       // SIMD (fast)
    let b = pipe.dot(&a, &z);       // SIMD (fast)
    let c = pipe.matmul(&b, &w);    // GPU (queued)
    let d = pipe.matmul(&c, &v);    // GPU (batched!)
    d
});

Recommendations Summary

For Vector Operations

Always use SIMD - Zero overhead, 2-4x speedup
Never use GPU - 2000x+ slower due to transfer overhead
Use scalar only for <100 elements or debugging

For Matrix Operations

Use SIMD for matrices <500×500
Use GPU for matrices ≥500×500 (16-81x speedup)
Consider batching multiple GPU operations in future

General Guidelines

Latency-critical: Always SIMD (microsecond-scale)
Throughput-critical: GPU for large batches, SIMD otherwise
Portable: SIMD works everywhere (x86, ARM, WASM)
Maximum performance: Profile and choose dynamically

References

GPU Performance - Detailed GPU benchmarks (RTX 4090)
SIMD Performance - SIMD optimization techniques
Benchmarks Overview - Complete benchmark methodology
Full report: docs/gpu-benchmark-report-2025-11-23.md
Golden traces: golden_traces/ANALYSIS.md
Configuration: renacer.toml

Trueno - High-Performance SIMD/GPU Compute Library