Phase 15: Fused Q4K Quantized GEMV

Version: 1.0.0 Status: SPECIFICATION Date: 2025-01-15 Target: 2x Ollama CPU throughput for APR format Scope: Fused dequantization + dot product for Q4_K quantized inference

Executive Summary

This specification defines the implementation of fused Q4K dequant+dot kernels for trueno, enabling APR format to achieve performance parity with GGUF/llama.cpp. The current APR CPU path achieves ~15 tok/s vs llama.cpp's ~42 tok/s on TinyLlama-1.1B Q4_0.

Root Cause: APR's current path separates dequantization from GEMV, causing:

Extra memory traffic (dequant to temp buffer, then dot)
Cache pollution from intermediate f32 expansion
Missed SIMD fusion opportunities

Solution: Fused Q4K kernels that dequantize directly into SIMD registers during dot product computation.

1. Problem Statement

1.1 Current Architecture Gap

Implementation	Q4K Approach	Throughput (1.1B)
llama.cpp	Fused Q4×Q8 SIMD (llamafile)	~42 tok/s
candle	Inline AVX2 dequant+dot	~10 tok/s
APR (current)	Separate dequant → f32 dot	~15 tok/s
APR (target)	Fused Q4×Q8 SIMD	≥42 tok/s

1.2 Memory Bandwidth Analysis

For a 4096×4096 Q4_K matmul (single token decode):

Approach	Memory Read	Memory Write	Efficiency
Separate dequant	8 MB (Q4) + 64 MB (f32 temp)	64 MB	17%
Fused Q4×Q8	8 MB (Q4) + 4 MB (Q8 input)	0	100%

Key Insight: Fused approach eliminates 128 MB of unnecessary memory traffic per matmul.

1.3 Scope

IN SCOPE:

Fused Q4_K × Q8_K SIMD dot product (AVX2, AVX-512, NEON)
Cache-blocked quantized GEMV
Integration with realizar APR transformer
Benchmarks against llama.cpp baseline

OUT OF SCOPE:

GPU quantized kernels (see trueno-gpu Phase 16)
Training quantization
New quantization formats

2. Root Cause Analysis (5 Whys)

Why #1: Why is APR 2.8x slower than llama.cpp on CPU?

Answer: Each token generation performs ~200 matmuls, and APR's matmul is 2.8x slower.

Why #2: Why is APR matmul 2.8x slower?

Answer: APR dequantizes Q4_K weights to f32 before GEMV, while llama.cpp keeps data quantized.

Why #3: Why does dequantization hurt performance?

Answer: Dequantizing 4096×4096 Q4_K (8 MB) produces 64 MB of f32, exceeding L3 cache.

Why #4: Why not keep data quantized during GEMV?

Answer: trueno's current SIMD kernels only support f32 dot products, not Q4×Q8.

Why #5: Why weren't fused Q4×Q8 kernels implemented?

Answer: Phase 2 focused on f32 matmul parity with NumPy. This is the root cause.

3. Solution Architecture

3.1 Q4_K Block Format

Q4_K quantization uses 256-element blocks with super-blocks:

Block layout (256 elements = 32 super-blocks × 8 elements):
  - scales: [f16; 12] (12 bytes) - scale factors per super-block
  - d: f16 (2 bytes) - block-wide scale
  - dmin: f16 (2 bytes) - block-wide minimum
  - qs: [u8; 128] (128 bytes) - packed 4-bit quantized values

Total: 144 bytes per 256 elements = 4.5 bits/element

3.2 Fused Q4K×Q8K Kernel Design

The fused kernel computes dot(Q4_K_weights, Q8_K_input) without intermediate dequantization:

/// Fused Q4K × Q8K dot product
///
/// Computes: sum(dequant(q4) * dequant(q8)) directly in SIMD registers
#[target_feature(enable = "avx2")]
unsafe fn fused_q4k_q8k_dot_avx2(
    q4_block: &BlockQ4K,  // 256 quantized weights
    q8_block: &BlockQ8K,  // 256 quantized inputs
) -> f32 {
    // Step 1: Load scales (stays in registers)
    let d = f16_to_f32(q4_block.d);
    let dmin = f16_to_f32(q4_block.dmin);

    // Step 2: Process 32 elements at a time (4 super-blocks)
    let mut acc = _mm256_setzero_ps();

    for sb in 0..8 {  // 8 iterations × 32 elements = 256
        let offset = sb * 32;

        // Load Q4 nibbles (16 bytes = 32 values)
        let q4_packed = _mm_loadu_si128(&q4_block.qs[offset/2]);

        // Unpack nibbles to bytes: [n0|n1, n2|n3, ...] → [n0, n1, n2, n3, ...]
        let q4_lo = _mm256_and_si256(
            _mm256_cvtepu8_epi16(q4_packed),
            _mm256_set1_epi16(0x0F)
        );
        let q4_hi = _mm256_and_si256(
            _mm256_srli_epi16(_mm256_cvtepu8_epi16(q4_packed), 4),
            _mm256_set1_epi16(0x0F)
        );

        // Load Q8 values (32 bytes = 32 int8)
        let q8_vec = _mm256_loadu_si256(&q8_block.qs[offset]);

        // Integer multiply-add: q4 * q8
        let prod_lo = _mm256_maddubs_epi16(q4_lo, q8_vec_lo);
        let prod_hi = _mm256_maddubs_epi16(q4_hi, q8_vec_hi);

        // Accumulate with scale
        let scale = get_scale(q4_block.scales, sb);
        acc = _mm256_fmadd_ps(
            _mm256_cvtepi32_ps(_mm256_add_epi32(prod_lo, prod_hi)),
            _mm256_set1_ps(d * scale),
            acc
        );
    }

    // Horizontal sum
    horizontal_sum_avx2(acc)
}

3.3 Cache-Blocked GEMV

For large matrices, apply L2 cache blocking:

/// Cache-blocked Q4K GEMV
///
/// y[M] = A[M×K] × x[K] where A is Q4K quantized
pub fn q4k_gemv_blocked(
    output: &mut [f32],        // M outputs
    weights: &[BlockQ4K],      // M×K/256 blocks
    input: &BlockQ8K,          // K/256 blocks (quantized input)
    m: usize,
    k: usize,
) {
    const BLOCK_M: usize = 64;  // Rows per L2 block
    const BLOCK_K: usize = 4096; // Columns per L2 block (fits in L2)

    // Process in L2-friendly blocks
    for m_start in (0..m).step_by(BLOCK_M) {
        let m_end = (m_start + BLOCK_M).min(m);

        // Parallel over output rows
        (m_start..m_end).into_par_iter()
            .with_min_len(16)  // Avoid Rayon overhead
            .for_each(|row| {
                let mut sum = 0.0f32;

                // Process K dimension in blocks
                for k_block in 0..(k / 256) {
                    sum += fused_q4k_q8k_dot_avx2(
                        &weights[row * (k/256) + k_block],
                        &input[k_block],
                    );
                }

                output[row] = sum;
            });
    }
}

4. Implementation Plan

4.1 File Structure

trueno/src/
├── quantize/
│   ├── mod.rs           # Module exports
│   ├── formats.rs       # Q4_K, Q5_K, Q6_K, Q8_K structs
│   ├── fused_avx2.rs    # AVX2 fused kernels
│   ├── fused_avx512.rs  # AVX-512 fused kernels
│   ├── fused_neon.rs    # ARM NEON fused kernels
│   └── blocked_gemv.rs  # Cache-blocked GEMV

4.2 API Design

// trueno/src/lib.rs
pub mod quantize;

// Public API
pub use quantize::{
    BlockQ4K, BlockQ5K, BlockQ6K, BlockQ8K,
    q4k_gemv, q4k_gemm,
    quantize_f32_to_q4k, quantize_f32_to_q8k,
};

4.3 Integration with realizar

// realizar/src/apr_transformer.rs
use trueno::quantize::{q4k_gemv, BlockQ4K, BlockQ8K};

fn forward_ffn(&self, hidden: &mut [f32]) {
    // Quantize input to Q8
    let input_q8 = quantize_f32_to_q8k(hidden);

    // Fused Q4K × Q8K GEMV (no intermediate f32)
    q4k_gemv(
        &mut self.up_out,
        &self.up_weights_q4k,
        &input_q8,
        self.intermediate_dim,
        self.hidden_dim,
    );

    // ... rest of FFN
}

5. Falsifiable Hypotheses

H1: Fused Kernel Throughput

Claim: Fused Q4K×Q8K dot product achieves ≥2x throughput vs separate dequant+dot.

Falsification: Benchmark 10M dot products. If fused < 1.5x separate, hypothesis falsified.

Prediction: fused_throughput / separate_throughput ≥ 2.0

H2: Memory Bandwidth Reduction

Claim: Fused approach reduces memory traffic by ≥80% for Q4K matmul.

Falsification: Profile with perf stat. If LLC-load-misses reduced <50%, hypothesis falsified.

Prediction: memory_traffic_fused / memory_traffic_separate ≤ 0.2

H3: End-to-End Inference Speedup

Claim: APR with fused kernels achieves ≥2x Ollama throughput on TinyLlama-1.1B.

Falsification: Benchmark 100 tokens. If throughput < 1.5x Ollama, hypothesis falsified.

Prediction: apr_fused_throughput ≥ 2.0 × ollama_throughput

H4: Numerical Accuracy

Claim: Fused kernel produces results within 1e-3 relative error of f32 reference.

Falsification: Compare 1000 random dot products. If max_rel_error > 1e-3, falsified.

Prediction: max_relative_error < 1e-3

6. Benchmark Targets

6.1 Micro-Benchmarks

Kernel	Current (ns)	Target (ns)	Speedup
Q4K dequant (256 elem)	180	N/A	-
f32 dot (256 elem)	45	N/A	-
Separate (dequant+dot)	225	N/A	baseline
Fused Q4K×Q8K	N/A	<100	>2.2x

6.2 End-to-End Targets

Model	Format	Current	Target	vs Ollama
TinyLlama-1.1B	APR Q4_K	15 tok/s	≥42 tok/s	≥2x
Qwen2.5-0.5B	APR Q4_K	21 tok/s	≥60 tok/s	≥2x
Phi-2 2.7B	APR Q4_K	7 tok/s	≥20 tok/s	≥2x

7. llamafile Reference Analysis

7.1 Key Techniques from llamafile sgemm

Matrix Repacking: Transpose and repack B matrix for sequential access
4×1 Micro-kernel: Process 4 output rows simultaneously
L2 Cache Blocking: 64×64 blocks fit in L2 (256KB)
Fused Dequant: Q4 nibble extraction inline with FMA

7.2 Adaptation for trueno

llamafile	trueno Adaptation
C++ with inline asm	Rust with `std::arch` intrinsics
Fixed block sizes	Configurable via `BlockConfig`
OpenMP parallelism	Rayon with `with_min_len()`
Platform-specific files	Feature-gated backends

8. Testing Strategy

8.1 Unit Tests

#[test]
fn test_fused_q4k_q8k_dot_correctness() {
    let q4 = random_q4k_block();
    let q8 = random_q8k_block();

    let fused_result = fused_q4k_q8k_dot_avx2(&q4, &q8);
    let reference = reference_q4k_q8k_dot(&q4, &q8);

    assert!((fused_result - reference).abs() < 1e-3 * reference.abs());
}

#[test]
fn test_fused_kernel_speedup() {
    let (q4_blocks, q8_blocks) = setup_benchmark_data();

    let separate_time = bench(|| separate_dequant_dot(&q4_blocks, &q8_blocks));
    let fused_time = bench(|| fused_q4k_q8k_dot(&q4_blocks, &q8_blocks));

    assert!(separate_time / fused_time >= 1.5, "Fused must be ≥1.5x faster");
}

8.2 Integration Tests

#[test]
fn test_apr_inference_with_fused_kernels() {
    let model = load_apr_model("tinyllama-1.1b-q4k.apr");
    let input = "Hello, world!";

    let (output, throughput) = benchmark_inference(&model, input, 50);

    assert!(throughput >= 30.0, "Must achieve ≥30 tok/s");
    assert!(output.contains_coherent_text());
}

9. Revision History

Version	Date	Changes
1.0.0	2025-01-15	Initial specification

10. References

[1] Goto, K., & Van Geijn, R. A. (2008). "Anatomy of High-Performance Matrix Multiplication." ACM TOMS. [2] Intel Corporation. (2024). "Intel 64 and IA-32 Architectures Optimization Reference Manual." [3] Dettmers, T., et al. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS. [4] llamafile sgemm: https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/sgemm.cpp [5] Trueno Phase 2 Micro-Kernel: phase2-microkernel.md

Specification for Trueno Phase 15 (2025-01-15) Zero excuses. Zero defects. APR IS THE FORMAT.

Trueno - High-Performance SIMD/GPU Compute Library