Profiling and Performance Tuning

This chapter documents performance profiling techniques and optimization discoveries from the Sovereign AI Stack.

Thread Pool Optimization

The 2.05x Discovery

A major performance breakthrough was discovered through systematic profiling: reducing thread count from 48 to 16 yielded a 2.05x speedup in CPU inference.

Metric	48 Threads	16 Threads	Improvement
Throughput	12.4 tok/s	25.4 tok/s	2.05x
Overhead	3.5x	1.7x	2.06x
Per-token latency	80.6 ms	39.4 ms	2.05x

Root Cause Analysis

The default rayon thread pool uses all available logical cores (hyperthreads). For small work units like single-token inference, this causes:

Cache line bouncing - 48 threads invalidating L1/L2 constantly
False sharing - Adjacent output writes causing coherency traffic
Hyperthread contention - HT pairs fighting for same FPU
Rayon sync overhead - Work units too small for 48-way split

Optimal Thread Count Formula

Optimal threads = min(physical_cores, work_size / cache_line_size)

For Qwen 1.5B with 1536 hidden dimension:

1536 elements / 16 elements per cache line = 96 cache lines
12-16 threads = 6-8 cache lines per thread (optimal)
48 threads = 2 cache lines per thread (too fine-grained)

Implementation

The configure_optimal_thread_pool() function in realizar sets the optimal thread count:

#![allow(unused)]
fn main() {
use realizar::inference::configure_optimal_thread_pool;

// Set to 16 threads (or physical core count)
configure_optimal_thread_pool();

// Or set explicitly via environment
std::env::set_var("RAYON_NUM_THREADS", "16");
}

Profiling Tools

Micro-Level Profiling

cargo run --release --example micro_profile

Profiles individual operations (matmul, attention, FFN) to identify bottlenecks.

Layer-Level Profiling

cargo run --release --example layer_profile

Profiles generation timing to measure per-token latency and throughput.

Thread Sweep

for t in 8 10 12 14 16 18 20 24 32 48; do
    echo "=== $t threads ==="
    RAYON_NUM_THREADS=$t cargo run --release --example instrumented_forward 2>&1 | grep -E "Throughput|Per token"
done

Results Interpretation

Symptom	Likely Cause	Solution
Low throughput, high thread count	Thread overhead	Reduce threads
Low bandwidth utilization (<20%)	Compute-bound	SIMD optimization
High bandwidth, low throughput	Memory-bound	Better tiling
Variable latency	Cache thrashing	Thread affinity

Tile-Level Profiling (TILING-SPEC-001)

Trueno’s BrickProfiler supports hierarchical tile profiling:

#![allow(unused)]
fn main() {
use trueno::{BrickProfiler, TileLevel};

let mut profiler = BrickProfiler::new();
profiler.enable_tile_profiling();

// Profile a macro tile (L3/Global memory level)
let timer = profiler.start_tile(TileLevel::Macro, 0, 0);
// ... execute computation ...
profiler.stop_tile(timer, elements, flops);

// Get results
println!("{}", profiler.tile_summary());
}

Tile Hierarchy

Level	Memory	Typical Size	Use Case
Macro	L3/Global	32MB	Layer-level
Midi	L2/Shared	256KB	Head-level
Micro	L1/Registers	32KB	SIMD-level

Metrics

Metric	Formula	Interpretation
GFLOP/s	flops / seconds / 1e9	Compute throughput
Arithmetic Intensity	flops / bytes	>10 = compute-bound
Cache Efficiency	actual / peak	Target >50%

Remaining Optimization Opportunities

After thread optimization (25.4 tok/s), the remaining gap to 42 tok/s target is 1.66x:

Optimization	Expected Gain	Status
Thread count optimization	2.05x	Done
Fuse parallel regions	1.2-1.3x	Pending
SIMD attention (AVX-512)	1.2-1.4x	Pending
Reduce Vec allocations	1.1x	Pending

Previous: Optimization Iteration Next: Code Review

Keyboard shortcuts

The Batuta Book