SIMD Performance Analysis

Date: 2025-11-18 System: x86_64 Linux (AVX2-capable) Benchmark Tool: Criterion.rs

This chapter provides a deep dive into Trueno's SIMD performance characteristics, analyzing when SIMD provides speedups and when it doesn't.

Executive Summary

Comprehensive benchmarking reveals mixed results across operations. While some operations show excellent SIMD speedups (tanh: 6.5-8.8x), many element-wise operations show minimal or negative speedups, especially for SSE2.

Key Findings

  1. Activation functions (relu, tanh): Good to excellent SIMD speedups (1.2-8.8x)
  2. Reduction operations (dot, sum, max): Excellent SIMD speedups (3-4.5x)
  3. Element-wise operations (add, sub, div, fma): Minimal or negative SIMD benefit
  4. SSE2 backend: Frequently slower than scalar for simple operations
  5. Small workloads (<1000 elements): SIMD overhead often exceeds benefit

Performance by Operation Category

Excellent SIMD Performance (>5x speedup)

OperationSizeScalarSSE2AVX2SSE2 SpeedupAVX2 Speedup
tanh100891 ns137 ns101 ns6.5x8.8x
tanh10008.0 µs1.08 µs-7.4x-

Why tanh excels:

  • Compute-intensive operation (requires exp calculations)
  • SIMD processes 4-8 exponentials in parallel
  • No memory bottleneck (compute dominates)
  • AVX2's wider registers (8 vs 4 elements) provide 2x improvement over SSE2

Good SIMD Performance (1.1-2x speedup)

OperationSizeScalarSSE2AVX2SSE2 SpeedupAVX2 Speedup
relu10054.1 ns44.8 ns49.3 ns1.21x1.10x
scale10043.9 ns41.8 ns39.6 ns1.05x1.11x
scale1000104 ns111 ns90.8 ns0.94x1.15x
div10058.3 ns55.7 ns53.3 ns1.05x1.09x

Poor SIMD Performance (<1.1x or negative)

OperationSizeScalarSSE2AVX2SSE2 SpeedupAVX2 Speedup
sigmoid100364 ns405 ns393 ns0.90x0.93x
fma10046.8 ns48.8 ns42.8 ns0.96x1.09x
sub10046.0 ns59.9 ns49.9 ns0.77x0.92x
div1000142 ns218 ns142 ns0.65x1.00x

Root Cause Analysis

1. Memory Bandwidth Bottleneck

For simple operations, memory access dominates compute time. SIMD can't help with RAM speed.

2. SIMD Overhead for Small Workloads

Fixed ~20-50ns overhead per operation from setup, alignment checks, and remainder handling.

3. Suboptimal Implementations

Some operations (div, sigmoid) show regressions requiring investigation.

Next Steps

  • Fix SSE2 div, sigmoid, fma, sub implementations
  • Implement adaptive backend selection
  • Benchmark against NumPy/PyTorch