Benchmarks Overview

This chapter presents comprehensive benchmark results for Trueno across different backends and workload sizes.

Latest Benchmark Results

Date: 2025-11-18 Platform: x86_64 Linux (AVX2-capable) Compiler: rustc 1.83 (release mode, opt-level=3, LTO=true) Tool: Criterion.rs (statistical benchmarking)

Executive Summary

Trueno's SIMD and GPU backends deliver 2-8x speedups for most operations, with exceptional performance on reduction and compute-intensive operations.

Key Findings

  • Average speedup: 178.5% across all operations
  • Best speedup: 8.8x (tanh activation, AVX2, 100 elements)
  • Operations meeting ≥10% target: 66.7%
  • Reduction operations: 200-400% speedup (dot, sum, max)
  • Activation functions: 120-880% speedup (relu, tanh)
  • Element-wise ops: 3-115% speedup (varies by operation and size)

Benchmark Results by Operation

Reduction Operations (Excellent Performance)

Reduction operations show exceptional SIMD performance due to parallel accumulation:

OperationSizeScalar (ns)SSE2 (ns)AVX2 (ns)SSE2 SpeedupAVX2 Speedup
dot10036.1110.79-3.3x-
dot1000574.92130.79-4.4x-
dot100006126.801475.60-4.2x-
sum10032.7710.53-3.1x-
sum1000575.20138.60-4.2x-
sum100005883.101491.00-3.9x-
max10026.576.86-3.9x-
max1000395.0488.24-4.5x-
max100004193.301033.90-4.1x-

Why reduction operations excel:

  • Combines multiple operations in SIMD lanes (4-8 parallel accumulations)
  • No memory write bottleneck (single scalar result)
  • Horizontal reduction is highly optimized
  • Minimal overhead from setup/cleanup

Activation Functions (Good to Excellent Performance)

Activation functions benefit from SIMD, especially for compute-intensive operations:

OperationSizeScalar (ns)SSE2 (ns)AVX2 (ns)SSE2 SpeedupAVX2 Speedup
tanh1008911371016.5x8.8x
tanh100080001080-7.4x-
relu10054.144.849.31.21x1.10x

Why activation functions perform well:

  • Compute-intensive (tanh requires exp calculations)
  • SIMD processes 4-8 elements in parallel
  • No data dependencies between elements
  • AVX2 benefits from wider registers (8 f32 vs 4 for SSE2)

Element-Wise Operations (Mixed Performance)

Element-wise operations show variable performance, often limited by memory bandwidth:

OperationSizeScalar (ns)SSE2 (ns)AVX2 (ns)SSE2 SpeedupAVX2 Speedup
add10046.8942.50-1.10x-
add1000124.91121.51-1.03x-
add100001098.601044.60-1.05x-
mul10041.0338.75-1.06x-
mul1000119.03112.86-1.05x-
mul100001029.101064.30-0.97x ❌-
scale10043.941.839.61.05x1.11x
scale100010411190.80.94x1.15x

Why element-wise ops show limited speedups:

  • Memory bandwidth bottleneck: Simple operations (add, mul) are memory-bound, not compute-bound
  • Cache effects: Small workloads fit in L1 cache, scalar loop is efficient
  • Large workloads: Both scalar and SIMD become memory-bound
  • Overhead: SIMD setup/cleanup costs hurt small workloads (<1000 elements)

Performance by Backend

SSE2 (128-bit SIMD)

Availability: Guaranteed on all x86_64 CPUs Register width: 128 bits (4 × f32 or 2 × f64) Typical speedup: 2-4x for reduction ops, 1.05-1.15x for element-wise

Best operations:

  • ✅ Reduction (dot, sum, max): 3-4.5x
  • ✅ Activation functions (tanh, relu): 1.2-7.4x
  • ⚠️ Element-wise (add, mul): 1.03-1.10x

Limitations:

  • Limited to 4-way parallelism
  • Some operations (div, sigmoid) show regressions
  • Memory bandwidth limited for large workloads

AVX2 (256-bit SIMD)

Availability: Intel Haswell+ (2013+), AMD Zen+ (2018+) Register width: 256 bits (8 × f32 or 4 × f64) Typical speedup: 4-8x for reduction ops, 1.10-1.15x for element-wise

Best operations:

  • ✅ Activation functions (tanh): 8.8x
  • ✅ Scalar operations (scale): 1.15x
  • ✅ Reduction (expected 2x over SSE2, not yet benchmarked)

Advantages over SSE2:

  • 2x wider registers (8 vs 4 elements)
  • FMA (fused multiply-add) instructions
  • Better memory bandwidth utilization

GPU (WebGPU via wgpu)

Availability: Systems with Vulkan/Metal/DX12 support Typical speedup: 16-81x for large matrix operations (>500×500)

IMPORTANT: Empirical RTX 4090 benchmarking revealed that GPU has 3.5ms fixed transfer overhead, making it slower than SIMD for vector operations at ALL sizes.

GPU Performance Summary (2025-11-23, RTX 4090):

  • Matrix multiplication: 81x speedup on 1000×1000
  • Vector operations: 2000x+ slower than SIMD due to transfer overhead
  • 🎯 Recommendation: GPU only for matrix ops >500×500, otherwise use SIMD

Current Thresholds:

Workload TypeSize RangeRecommended Backend
Vector operationsAnySIMD (GPU disabled)
Matrix multiplication<500×500SIMD
Matrix multiplication≥500×500GPU

GPU Transfer Overhead: ~3.5ms per operation for CPU↔GPU↔CPU transfer

See GPU Performance for detailed RTX 4090 benchmark results and analysis.

Performance by Workload Size

Small (100 elements)

Recommended backend: SSE2 or Scalar SIMD benefit: 5-10% for most ops, 120-650% for activation/reduction

At small sizes, SIMD overhead (setup, remainder handling) can exceed benefits for simple operations.

Medium (1K-10K elements)

Recommended backend: SSE2/AVX2 SIMD benefit: 3-440% depending on operation

Sweet spot for SIMD: large enough to amortize overhead, small enough to avoid memory bottlenecks.

Large (100K+ elements)

Recommended backend: GPU (if available), otherwise AVX2 SIMD benefit: 0-400% (memory-bound for simple ops, good for reductions)

At large sizes:

  • Element-wise ops become memory-bound
  • Reduction ops still benefit from SIMD
  • GPU provides best performance if transfer overhead is justified

Benchmark Methodology

Tool: Criterion.rs

All benchmarks use Criterion.rs for statistical rigor:

  • Samples: 100 per benchmark
  • Warmup: 3 seconds
  • Measurement: 5 seconds
  • Outlier detection: Automated
  • Statistical analysis: Mean, median, standard deviation

Test Data

  • Sequential floats: (i as f32) * 0.5
  • Workload sizes: 100, 1000, 10000, 100000 elements
  • Backend comparison: Scalar vs SSE2 vs AVX2 vs GPU

Environment

  • CPU: x86_64 with AVX2 support
  • RAM: 16GB+ (prevents swapping)
  • Compiler flags: -C opt-level=3 -C lto=true -C codegen-units=1
  • CPU affinity: Pinned to single core (reduces variance)
  • Background processes: Minimized

Quality Standards

Every benchmark must meet these criteria:

  1. Coefficient of Variation (CV) < 5% - Consistent results across runs
  2. No regressions >5% - SIMD should not be slower than scalar
  3. Statistical significance - 100+ samples for reliable mean/median
  4. Baseline comparison - Always compare against scalar implementation

Interpreting Results

Speedup calculation: (scalar_time / simd_time)

SpeedupStatusInterpretation
≥2.0x✅ ExcellentSIMD delivers significant value
1.5-2.0x✅ GoodSIMD worth the complexity
1.1-1.5x⚠️ MarginalConsider simpler scalar code
1.0-1.1x⚠️ MinimalSIMD overhead may not be worth it
<1.0x❌ RegressionFix implementation or use scalar

Reproducing Benchmarks

Run all benchmarks:

cargo bench --bench vector_ops

Run specific operation:

cargo bench --bench vector_ops -- dot

Generate HTML report:

cargo bench --bench vector_ops
open target/criterion/report/index.html

Compare against baseline:

# Save current results as baseline
cargo bench -- --save-baseline main

# Make changes, then compare
cargo bench -- --baseline main

Next Steps