Backend Comparison

This chapter compares Trueno's three execution backends (Scalar, SIMD, GPU) across different operation types and workload sizes, providing guidance on when to use each.

Backend Overview

BackendAvailabilityTypical SpeedupBest Use Case
ScalarAll platforms1x (baseline)Small workloads, reference implementation
SIMDx86_64 (SSE2+), ARM (NEON), WASM2-4xMost operations, <1M elements
GPUVulkan/Metal/DX12 systems10-80xLarge matrix ops (>500×500)

Decision Matrix

Use this table to choose the optimal backend for your workload:

Operation TypeSize RangeRecommended BackendExpected Speedup
Vector Add/MulAnySIMD1.1-1.3x
Dot Product<1MSIMD3-4x
Dot Product>1MSIMD3-4x
Matrix Mul<500×500SIMD2-4x
Matrix Mul500×500-1000×1000GPU16-81x
Matrix Mul>1000×1000GPU80x+
Activations (ReLU, Sigmoid)AnySIMD1.2-7x
Reductions (Sum, Max)AnySIMD3-4x

Scalar Backend

Characteristics

  • Pros:

    • Zero overhead
    • Simple, maintainable code
    • Predictable performance
    • Works everywhere
  • Cons:

    • No parallelism
    • Slowest for compute-heavy operations

When to Use Scalar

  • Reference implementation for correctness testing
  • Platforms without SIMD support (rare)
  • Debugging (simpler code paths)
  • Very small workloads (<100 elements) where SIMD overhead dominates

Performance

OperationSizeTimeThroughput
Vector Add10K819 ns12.21 Gelem/s
Dot Product10K6.30 µs1.59 Gelem/s
Matrix Mul1000×1000638.7 ms1.57 Gelem/s

SIMD Backend

Characteristics

  • Pros:

    • Zero transfer overhead
    • 2-4x speedup for most operations
    • Low latency (<10µs for typical ops)
    • Works on all modern CPUs
  • Cons:

    • Limited parallelism (4-8 elements)
    • Complex implementation
    • Platform-specific code

SIMD Instruction Sets

ISARegister WidthElements (f32)Availability
SSE2128-bit4All x86_64 CPUs
AVX256-bit8Intel Sandy Bridge+ (2011+)
AVX2256-bit + FMA8Intel Haswell+ (2013+)
AVX-512512-bit16Intel Skylake-X+ (2017+), AMD Zen 4+ (2022+)
NEON128-bit4All ARM64 CPUs
SIMD128128-bit4Modern browsers (WASM)

SIMD Performance (SSE2)

From golden traces (golden_traces/performance_demo_summary.txt):

OperationSizeScalarSIMD (SSE2)SpeedupRuntimeSyscalls
Dot Product10K6.26µs1.55µs4.0x1.507ms138
Sum Reduction10K7.12µs1.69µs4.2x1.507ms138
Max Finding10K4.19µs1.06µs4.0x1.507ms138
Element-wise Add10K1.44µs1.10µs1.3x1.507ms138
Element-wise Mul10K1.10µs1.10µs1.0x1.507ms138

Why SIMD Excels

Zero overhead architecture:

  • No data transfer (operates directly on CPU cache)
  • No synchronization (single-threaded execution)
  • Immediate execution (no queuing or dispatch)

Optimal for:

  • ✅ Reduction operations (dot, sum, max): Parallel accumulation
  • ✅ Compute-intensive ops (tanh, sigmoid): Amortizes instruction overhead
  • ⚠️ Memory-bound ops (add, mul): Limited by RAM bandwidth, not compute

GPU Backend

Characteristics

  • Pros:

    • Massive parallelism (thousands of cores)
    • 80x+ speedup for large matrix operations
    • Excellent for O(n³) algorithms
  • Cons:

    • 3.5ms fixed overhead per operation
    • Requires PCIe transfer (CPU↔GPU)
    • Only beneficial for large workloads
    • Not always available

GPU Transfer Overhead

Critical limitation: Every GPU operation incurs ~3.5ms fixed cost:

ComponentTimeDescription
Buffer creation0.5 msAllocate GPU-side memory
CPU→GPU transfer1.5 msPCIe bandwidth limitation
Kernel dispatch0.3 msGPU scheduling
GPU→CPU readback1.2 msPCIe bandwidth limitation
Total3.5 msMinimum per operation

GPU Performance (RTX 4090)

Vector operations (❌ GPU fails):

OperationSizeGPU TimeSIMD TimeVerdict
Vector Add10K3.44 ms1.10 µsSIMD 3127x faster
Dot Product10K3.32 ms1.55 µsSIMD 2144x faster
ReLU1M6.03 ms67.1 µsSIMD 90x faster
Sigmoid1M5.81 ms3.18 msSIMD 1.8x faster

Matrix operations (✅ GPU wins):

SizeGPU TimeScalar TimeSpeedup
100×1004.14 ms530.8 µs0.13x ❌
500×5004.59 ms77.4 ms16.9x
1000×10007.84 ms638.7 ms81.5x

Why GPU Fails for Vector Operations

Transfer overhead dominates:

  • 10K vector add: 1.1µs compute vs 3500µs transfer → 3182x overhead
  • 1M vector add: 96.5µs compute vs 3500µs transfer → 36x overhead

Even compute-heavy ops suffer:

  • 1M sigmoid: 3.18ms compute vs 3.5ms transfer → Barely competitive

Why GPU Wins for Matrix Operations

O(n³) complexity overwhelms transfer cost:

  • 500×500 matmul: 125M ops → 77ms scalar → GPU wins at 4.6ms (13x amortization)
  • 1000×1000 matmul: 1B ops → 639ms scalar → GPU wins at 7.8ms (81x amortization)

GPU becomes competitive when: compute_time_scalar > 10 × transfer_overhead

For matrix multiplication:

  • 500×500: 77ms compute >> 3.5ms transfer → GPU wins
  • 100×100: 531µs compute << 3.5ms transfer → GPU loses

Backend Comparison by Operation Type

Element-Wise Operations (add, mul, scale)

BackendTypical Time (10K)Speedup vs ScalarVerdict
Scalar800 ns1.0xBaseline
SIMD600 ns1.3x✅ Use SIMD
GPU3400 µs0.0002x❌ Never use GPU

Recommendation: Always use SIMD. Memory-bound, but SIMD has zero overhead.

Reduction Operations (dot, sum, max)

BackendTypical Time (10K)Speedup vs ScalarVerdict
Scalar6.3 µs1.0xBaseline
SIMD1.5 µs4.0x✅ Use SIMD
GPU3320 µs0.002x❌ Never use GPU

Recommendation: Always use SIMD. Excellent parallel accumulation, zero overhead.

Activation Functions (relu, sigmoid, tanh)

BackendTypical Time (1M)Speedup vs ScalarVerdict
Scalar (ReLU)67.1 µs1.0xBaseline
SIMD (ReLU)~20 µs~3x✅ Use SIMD
GPU (ReLU)6030 µs0.011x❌ Never use GPU
Scalar (Sigmoid)3.18 ms1.0xBaseline
SIMD (Sigmoid)~1 ms~3x✅ Use SIMD
GPU (Sigmoid)5.81 ms0.55x❌ Never use GPU

Recommendation: Always use SIMD, even for compute-heavy activations.

Matrix Multiplication

BackendTime (1000×1000)Speedup vs ScalarVerdict
Scalar638.7 ms1.0xBaseline
SIMD~160 ms~4x✅ Use for <500×500
GPU7.84 ms81.5x✅ Use for >500×500

Recommendation: Use GPU for matrices >500×500, otherwise SIMD.

Threshold Guidelines

Current Trueno Thresholds

// Vector operations (src/vector.rs:1316)
const GPU_THRESHOLD: usize = usize::MAX; // GPU DISABLED

// Matrix operations (src/matrix.rs:268)
const GPU_THRESHOLD: usize = 500; // 500×500 minimum

Size-Based Recommendations

Workload SizeVector OpsMatrix OpsRationale
<100Scalar/SIMDScalar/SIMDSIMD overhead marginal
100-1KSIMDSIMDSweet spot for SIMD
1K-100KSIMDSIMDSIMD still optimal
100K-500×500SIMDSIMDGPU overhead too high
500×500-1000×1000SIMDGPUO(n³) amortizes overhead
>1000×1000SIMDGPUMassive compute dominates

Operation Complexity Classes

Trueno categorizes operations by complexity:

pub enum OpComplexity {
    Low,    // Simple ops: add, mul (GPU disabled)
    Medium, // Moderate: dot, reduce (GPU at 100K+)
    High,   // Complex: matmul, conv2d (GPU at 500×500+)
}

Performance Validation

Golden Trace Baselines

Performance budgets in renacer.toml ensure SIMD doesn't regress:

[performance.budgets]
backend_detection = { max_time_ms = 2.0, max_syscalls = 200 }
matrix_operations = { max_time_ms = 2.0, max_syscalls = 200 }
activation_functions = { max_time_ms = 2.0, max_syscalls = 200 }

All SIMD operations must complete in <2ms with <200 syscalls.

Validation Tests

tests/golden_trace_validation.rs ensures:

  • SIMD performance matches golden traces (±10%)
  • No unexpected syscall patterns
  • Runtime stays under budget

Future: Hybrid Scheduling (v2.0)

Current API forces a single backend per operation. Future hybrid scheduling will:

  1. Profile operation characteristics at runtime
  2. Dynamically select backend based on actual compute time
  3. Batch GPU operations to amortize transfer overhead
  4. Overlap CPU and GPU work for pipeline parallelism

Example future API:

let scheduler = HybridScheduler::new()
    .prefer_simd_threshold_ms(5.0)  // Use SIMD if op <5ms
    .gpu_batch_window_ms(10.0);     // Batch GPU ops within 10ms

scheduler.execute_pipeline(|pipe| {
    let a = pipe.add(&x, &y);       // SIMD (fast)
    let b = pipe.dot(&a, &z);       // SIMD (fast)
    let c = pipe.matmul(&b, &w);    // GPU (queued)
    let d = pipe.matmul(&c, &v);    // GPU (batched!)
    d
});

Recommendations Summary

For Vector Operations

  1. Always use SIMD - Zero overhead, 2-4x speedup
  2. Never use GPU - 2000x+ slower due to transfer overhead
  3. Use scalar only for <100 elements or debugging

For Matrix Operations

  1. Use SIMD for matrices <500×500
  2. Use GPU for matrices ≥500×500 (16-81x speedup)
  3. Consider batching multiple GPU operations in future

General Guidelines

  • Latency-critical: Always SIMD (microsecond-scale)
  • Throughput-critical: GPU for large batches, SIMD otherwise
  • Portable: SIMD works everywhere (x86, ARM, WASM)
  • Maximum performance: Profile and choose dynamically

References

  • GPU Performance - Detailed GPU benchmarks (RTX 4090)
  • SIMD Performance - SIMD optimization techniques
  • Benchmarks Overview - Complete benchmark methodology
  • Full report: docs/gpu-benchmark-report-2025-11-23.md
  • Golden traces: golden_traces/ANALYSIS.md
  • Configuration: renacer.toml