SIMD Performance Analysis

Date: 2025-11-18 System: x86_64 Linux (AVX2-capable) Benchmark Tool: Criterion.rs

This chapter provides a deep dive into Trueno's SIMD performance characteristics, analyzing when SIMD provides speedups and when it doesn't.

Executive Summary

Comprehensive benchmarking reveals mixed results across operations. While some operations show excellent SIMD speedups (tanh: 6.5-8.8x), many element-wise operations show minimal or negative speedups, especially for SSE2.

Key Findings

Activation functions (relu, tanh): Good to excellent SIMD speedups (1.2-8.8x)
Reduction operations (dot, sum, max): Excellent SIMD speedups (3-4.5x)
Element-wise operations (add, sub, div, fma): Minimal or negative SIMD benefit
SSE2 backend: Frequently slower than scalar for simple operations
Small workloads (<1000 elements): SIMD overhead often exceeds benefit

Performance by Operation Category

Excellent SIMD Performance (>5x speedup)

Operation	Size	Scalar	SSE2	AVX2	SSE2 Speedup	AVX2 Speedup
tanh	100	891 ns	137 ns	101 ns	6.5x	8.8x
tanh	1000	8.0 µs	1.08 µs	-	7.4x	-

Why tanh excels:

Compute-intensive operation (requires exp calculations)
SIMD processes 4-8 exponentials in parallel
No memory bottleneck (compute dominates)
AVX2's wider registers (8 vs 4 elements) provide 2x improvement over SSE2

Good SIMD Performance (1.1-2x speedup)

Operation	Size	Scalar	SSE2	AVX2	SSE2 Speedup	AVX2 Speedup
relu	100	54.1 ns	44.8 ns	49.3 ns	1.21x	1.10x
scale	100	43.9 ns	41.8 ns	39.6 ns	1.05x	1.11x
scale	1000	104 ns	111 ns	90.8 ns	0.94x	1.15x
div	100	58.3 ns	55.7 ns	53.3 ns	1.05x	1.09x

Poor SIMD Performance (<1.1x or negative)

Operation	Size	Scalar	SSE2	AVX2	SSE2 Speedup	AVX2 Speedup
sigmoid	100	364 ns	405 ns	393 ns	0.90x ❌	0.93x ❌
fma	100	46.8 ns	48.8 ns	42.8 ns	0.96x ❌	1.09x
sub	100	46.0 ns	59.9 ns	49.9 ns	0.77x ❌	0.92x ❌
div	1000	142 ns	218 ns	142 ns	0.65x ❌	1.00x

Root Cause Analysis

1. Memory Bandwidth Bottleneck

For simple operations, memory access dominates compute time. SIMD can't help with RAM speed.

2. SIMD Overhead for Small Workloads

Fixed ~20-50ns overhead per operation from setup, alignment checks, and remainder handling.

3. Suboptimal Implementations

Some operations (div, sigmoid) show regressions requiring investigation.

Next Steps

Fix SSE2 div, sigmoid, fma, sub implementations
Implement adaptive backend selection
Benchmark against NumPy/PyTorch

Trueno - High-Performance SIMD/GPU Compute Library