BLIS-Style Matrix Multiplication

High-performance GEMM (General Matrix Multiply) implementation based on the BLIS framework.

Overview

The BLIS implementation achieves 72 GFLOP/s on modern x86_64 CPUs, a significant improvement over naive implementations (~10 GFLOP/s).

Running the Benchmark

cargo run --release --example blis_benchmark

Expected output:

=== BLIS GEMM Benchmark ===

--- Reference vs BLIS Comparison ---
64x64: Reference      0.1ms, BLIS      0.0ms, Speedup:   3.5x
128x128: Reference    1.2ms, BLIS      0.1ms, Speedup:   8.7x
256x256: Reference   12.4ms, BLIS      0.8ms, Speedup:  14.6x
512x512: Reference  159.0ms, BLIS      4.5ms, Speedup:  35.1x

--- BLIS Performance ---
BLIS                   64x  64:     20.3 us,   25.9 GFLOP/s
BLIS                  128x 128:     99.5 us,   42.2 GFLOP/s
BLIS                  256x 256:    544.1 us,   61.7 GFLOP/s
BLIS                  512x 512:   3752.0 us,   71.5 GFLOP/s
BLIS                 1024x1024:  30965.0 us,   69.4 GFLOP/s

Algorithm

The implementation uses the 5-loop BLIS algorithm:

Loop 5: jc (N dimension, L3 blocking)
  Loop 4: pc (K dimension, L2 blocking)
    Pack B → B̃ (nc × kc panel)
    Loop 3: ic (M dimension, L1 blocking)
      Pack A → Ã (mc × kc panel)
      Loop 2: jr (NR micro-tiles)
        Loop 1: ir (MR micro-tiles)
          MICROKERNEL: C[ir,jr] += Ã[ir,:] × B̃[:,jr]

Configuration Constants

Constant	Value	Purpose
MR	8	Microkernel rows (AVX2 = 8 f32)
NR	6	Microkernel columns
KC	256	K-blocking for L1 cache
MC	72	M-blocking for L2 cache
NC	4096	N-blocking for L3 cache

API Usage

Basic GEMM

use trueno::blis::gemm;

let m = 256;
let n = 256;
let k = 256;

let a: Vec<f32> = vec![1.0; m * k];
let b: Vec<f32> = vec![1.0; k * n];
let mut c: Vec<f32> = vec![0.0; m * n];

// C += A × B
gemm(m, n, k, &a, &b, &mut c).unwrap();

With Profiling

use trueno::blis::{gemm_profiled, BlisProfiler};

let mut profiler = BlisProfiler::enabled();

gemm_profiled(m, n, k, &a, &b, &mut c, &mut profiler).unwrap();

println!("{}", profiler.summary());

Output:

BLIS Profiler Summary
=====================
Macro: 578.8us avg, 58.0 GFLOP/s, 1 calls
Midi:  130.0us avg, 64.6 GFLOP/s, 4 calls
Micro: 0.4us avg, 69.2 GFLOP/s, 1376 calls
Pack:  9.7us avg, 5 calls
Total: 58.0 GFLOP/s

Toyota Production System Integration

Jidoka (Stop on Defect)

Runtime guards catch numerical errors:

use trueno::blis::{JidokaGuard, gemm_reference_with_jidoka};

let guard = JidokaGuard::strict();

// Will return Err if NaN/Inf detected
gemm_reference_with_jidoka(m, n, k, &a, &b, &mut c, &guard)?;

Heijunka (Load Leveling)

Parallel execution uses balanced partitioning:

use trueno::blis::HeijunkaScheduler;

let scheduler = HeijunkaScheduler::default();
let partitions = scheduler.partition_m(1024, 72);
// Returns balanced work ranges for each thread

Performance Analysis

Roofline Model

For GEMM with optimal blocking:

Arithmetic Intensity: AI ≈ K/2 FLOP/byte
At K=512: AI ≈ 256 → compute-bound (good)

Achieved vs Theoretical

Size	Achieved	Theoretical	Efficiency
256×256	61 GFLOP/s	~400 GFLOP/s	15%
512×512	71 GFLOP/s	~400 GFLOP/s	18%
1024×1024	72 GFLOP/s	~400 GFLOP/s	18%

References

Goto, K., & Van de Geijn, R. A. (2008). Anatomy of High-Performance Matrix Multiplication. ACM TOMS, 34(3).
Van Zee, F. G., & Van de Geijn, R. A. (2015). BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM TOMS, 41(3).
Low, T. M., et al. (2016). Analytical Modeling Is Enough for High-Performance BLIS. ACM TOMS, 43(2).

Trueno - High-Performance SIMD/GPU Compute Library