Phase 15: Fused Q4K Quantized GEMV
Version: 1.0.0 Status: SPECIFICATION Date: 2025-01-15 Target: 2x Ollama CPU throughput for APR format Scope: Fused dequantization + dot product for Q4_K quantized inference
Executive Summary
This specification defines the implementation of fused Q4K dequant+dot kernels for trueno, enabling APR format to achieve performance parity with GGUF/llama.cpp. The current APR CPU path achieves ~15 tok/s vs llama.cpp's ~42 tok/s on TinyLlama-1.1B Q4_0.
Root Cause: APR's current path separates dequantization from GEMV, causing:
- Extra memory traffic (dequant to temp buffer, then dot)
- Cache pollution from intermediate f32 expansion
- Missed SIMD fusion opportunities
Solution: Fused Q4K kernels that dequantize directly into SIMD registers during dot product computation.
1. Problem Statement
1.1 Current Architecture Gap
| Implementation | Q4K Approach | Throughput (1.1B) |
|---|---|---|
| llama.cpp | Fused Q4×Q8 SIMD (llamafile) | ~42 tok/s |
| candle | Inline AVX2 dequant+dot | ~10 tok/s |
| APR (current) | Separate dequant → f32 dot | ~15 tok/s |
| APR (target) | Fused Q4×Q8 SIMD | ≥42 tok/s |
1.2 Memory Bandwidth Analysis
For a 4096×4096 Q4_K matmul (single token decode):
| Approach | Memory Read | Memory Write | Efficiency |
|---|---|---|---|
| Separate dequant | 8 MB (Q4) + 64 MB (f32 temp) | 64 MB | 17% |
| Fused Q4×Q8 | 8 MB (Q4) + 4 MB (Q8 input) | 0 | 100% |
Key Insight: Fused approach eliminates 128 MB of unnecessary memory traffic per matmul.
1.3 Scope
IN SCOPE:
- Fused Q4_K × Q8_K SIMD dot product (AVX2, AVX-512, NEON)
- Cache-blocked quantized GEMV
- Integration with realizar APR transformer
- Benchmarks against llama.cpp baseline
OUT OF SCOPE:
- GPU quantized kernels (see trueno-gpu Phase 16)
- Training quantization
- New quantization formats
2. Root Cause Analysis (5 Whys)
Why #1: Why is APR 2.8x slower than llama.cpp on CPU?
Answer: Each token generation performs ~200 matmuls, and APR's matmul is 2.8x slower.
Why #2: Why is APR matmul 2.8x slower?
Answer: APR dequantizes Q4_K weights to f32 before GEMV, while llama.cpp keeps data quantized.
Why #3: Why does dequantization hurt performance?
Answer: Dequantizing 4096×4096 Q4_K (8 MB) produces 64 MB of f32, exceeding L3 cache.
Why #4: Why not keep data quantized during GEMV?
Answer: trueno's current SIMD kernels only support f32 dot products, not Q4×Q8.
Why #5: Why weren't fused Q4×Q8 kernels implemented?
Answer: Phase 2 focused on f32 matmul parity with NumPy. This is the root cause.
3. Solution Architecture
3.1 Q4_K Block Format
Q4_K quantization uses 256-element blocks with super-blocks:
Block layout (256 elements = 32 super-blocks × 8 elements):
- scales: [f16; 12] (12 bytes) - scale factors per super-block
- d: f16 (2 bytes) - block-wide scale
- dmin: f16 (2 bytes) - block-wide minimum
- qs: [u8; 128] (128 bytes) - packed 4-bit quantized values
Total: 144 bytes per 256 elements = 4.5 bits/element
3.2 Fused Q4K×Q8K Kernel Design
The fused kernel computes dot(Q4_K_weights, Q8_K_input) without intermediate dequantization:
/// Fused Q4K × Q8K dot product
///
/// Computes: sum(dequant(q4) * dequant(q8)) directly in SIMD registers
#[target_feature(enable = "avx2")]
unsafe fn fused_q4k_q8k_dot_avx2(
q4_block: &BlockQ4K, // 256 quantized weights
q8_block: &BlockQ8K, // 256 quantized inputs
) -> f32 {
// Step 1: Load scales (stays in registers)
let d = f16_to_f32(q4_block.d);
let dmin = f16_to_f32(q4_block.dmin);
// Step 2: Process 32 elements at a time (4 super-blocks)
let mut acc = _mm256_setzero_ps();
for sb in 0..8 { // 8 iterations × 32 elements = 256
let offset = sb * 32;
// Load Q4 nibbles (16 bytes = 32 values)
let q4_packed = _mm_loadu_si128(&q4_block.qs[offset/2]);
// Unpack nibbles to bytes: [n0|n1, n2|n3, ...] → [n0, n1, n2, n3, ...]
let q4_lo = _mm256_and_si256(
_mm256_cvtepu8_epi16(q4_packed),
_mm256_set1_epi16(0x0F)
);
let q4_hi = _mm256_and_si256(
_mm256_srli_epi16(_mm256_cvtepu8_epi16(q4_packed), 4),
_mm256_set1_epi16(0x0F)
);
// Load Q8 values (32 bytes = 32 int8)
let q8_vec = _mm256_loadu_si256(&q8_block.qs[offset]);
// Integer multiply-add: q4 * q8
let prod_lo = _mm256_maddubs_epi16(q4_lo, q8_vec_lo);
let prod_hi = _mm256_maddubs_epi16(q4_hi, q8_vec_hi);
// Accumulate with scale
let scale = get_scale(q4_block.scales, sb);
acc = _mm256_fmadd_ps(
_mm256_cvtepi32_ps(_mm256_add_epi32(prod_lo, prod_hi)),
_mm256_set1_ps(d * scale),
acc
);
}
// Horizontal sum
horizontal_sum_avx2(acc)
}
3.3 Cache-Blocked GEMV
For large matrices, apply L2 cache blocking:
/// Cache-blocked Q4K GEMV
///
/// y[M] = A[M×K] × x[K] where A is Q4K quantized
pub fn q4k_gemv_blocked(
output: &mut [f32], // M outputs
weights: &[BlockQ4K], // M×K/256 blocks
input: &BlockQ8K, // K/256 blocks (quantized input)
m: usize,
k: usize,
) {
const BLOCK_M: usize = 64; // Rows per L2 block
const BLOCK_K: usize = 4096; // Columns per L2 block (fits in L2)
// Process in L2-friendly blocks
for m_start in (0..m).step_by(BLOCK_M) {
let m_end = (m_start + BLOCK_M).min(m);
// Parallel over output rows
(m_start..m_end).into_par_iter()
.with_min_len(16) // Avoid Rayon overhead
.for_each(|row| {
let mut sum = 0.0f32;
// Process K dimension in blocks
for k_block in 0..(k / 256) {
sum += fused_q4k_q8k_dot_avx2(
&weights[row * (k/256) + k_block],
&input[k_block],
);
}
output[row] = sum;
});
}
}
4. Implementation Plan
4.1 File Structure
trueno/src/
├── quantize/
│ ├── mod.rs # Module exports
│ ├── formats.rs # Q4_K, Q5_K, Q6_K, Q8_K structs
│ ├── fused_avx2.rs # AVX2 fused kernels
│ ├── fused_avx512.rs # AVX-512 fused kernels
│ ├── fused_neon.rs # ARM NEON fused kernels
│ └── blocked_gemv.rs # Cache-blocked GEMV
4.2 API Design
// trueno/src/lib.rs
pub mod quantize;
// Public API
pub use quantize::{
BlockQ4K, BlockQ5K, BlockQ6K, BlockQ8K,
q4k_gemv, q4k_gemm,
quantize_f32_to_q4k, quantize_f32_to_q8k,
};
4.3 Integration with realizar
// realizar/src/apr_transformer.rs
use trueno::quantize::{q4k_gemv, BlockQ4K, BlockQ8K};
fn forward_ffn(&self, hidden: &mut [f32]) {
// Quantize input to Q8
let input_q8 = quantize_f32_to_q8k(hidden);
// Fused Q4K × Q8K GEMV (no intermediate f32)
q4k_gemv(
&mut self.up_out,
&self.up_weights_q4k,
&input_q8,
self.intermediate_dim,
self.hidden_dim,
);
// ... rest of FFN
}
5. Falsifiable Hypotheses
H1: Fused Kernel Throughput
Claim: Fused Q4K×Q8K dot product achieves ≥2x throughput vs separate dequant+dot.
Falsification: Benchmark 10M dot products. If fused < 1.5x separate, hypothesis falsified.
Prediction: fused_throughput / separate_throughput ≥ 2.0
H2: Memory Bandwidth Reduction
Claim: Fused approach reduces memory traffic by ≥80% for Q4K matmul.
Falsification: Profile with perf stat. If LLC-load-misses reduced <50%, hypothesis falsified.
Prediction: memory_traffic_fused / memory_traffic_separate ≤ 0.2
H3: End-to-End Inference Speedup
Claim: APR with fused kernels achieves ≥2x Ollama throughput on TinyLlama-1.1B.
Falsification: Benchmark 100 tokens. If throughput < 1.5x Ollama, hypothesis falsified.
Prediction: apr_fused_throughput ≥ 2.0 × ollama_throughput
H4: Numerical Accuracy
Claim: Fused kernel produces results within 1e-3 relative error of f32 reference.
Falsification: Compare 1000 random dot products. If max_rel_error > 1e-3, falsified.
Prediction: max_relative_error < 1e-3
6. Benchmark Targets
6.1 Micro-Benchmarks
| Kernel | Current (ns) | Target (ns) | Speedup |
|---|---|---|---|
| Q4K dequant (256 elem) | 180 | N/A | - |
| f32 dot (256 elem) | 45 | N/A | - |
| Separate (dequant+dot) | 225 | N/A | baseline |
| Fused Q4K×Q8K | N/A | <100 | >2.2x |
6.2 End-to-End Targets
| Model | Format | Current | Target | vs Ollama |
|---|---|---|---|---|
| TinyLlama-1.1B | APR Q4_K | 15 tok/s | ≥42 tok/s | ≥2x |
| Qwen2.5-0.5B | APR Q4_K | 21 tok/s | ≥60 tok/s | ≥2x |
| Phi-2 2.7B | APR Q4_K | 7 tok/s | ≥20 tok/s | ≥2x |
7. llamafile Reference Analysis
7.1 Key Techniques from llamafile sgemm
- Matrix Repacking: Transpose and repack B matrix for sequential access
- 4×1 Micro-kernel: Process 4 output rows simultaneously
- L2 Cache Blocking: 64×64 blocks fit in L2 (256KB)
- Fused Dequant: Q4 nibble extraction inline with FMA
7.2 Adaptation for trueno
| llamafile | trueno Adaptation |
|---|---|
| C++ with inline asm | Rust with std::arch intrinsics |
| Fixed block sizes | Configurable via BlockConfig |
| OpenMP parallelism | Rayon with with_min_len() |
| Platform-specific files | Feature-gated backends |
8. Testing Strategy
8.1 Unit Tests
#[test]
fn test_fused_q4k_q8k_dot_correctness() {
let q4 = random_q4k_block();
let q8 = random_q8k_block();
let fused_result = fused_q4k_q8k_dot_avx2(&q4, &q8);
let reference = reference_q4k_q8k_dot(&q4, &q8);
assert!((fused_result - reference).abs() < 1e-3 * reference.abs());
}
#[test]
fn test_fused_kernel_speedup() {
let (q4_blocks, q8_blocks) = setup_benchmark_data();
let separate_time = bench(|| separate_dequant_dot(&q4_blocks, &q8_blocks));
let fused_time = bench(|| fused_q4k_q8k_dot(&q4_blocks, &q8_blocks));
assert!(separate_time / fused_time >= 1.5, "Fused must be ≥1.5x faster");
}
8.2 Integration Tests
#[test]
fn test_apr_inference_with_fused_kernels() {
let model = load_apr_model("tinyllama-1.1b-q4k.apr");
let input = "Hello, world!";
let (output, throughput) = benchmark_inference(&model, input, 50);
assert!(throughput >= 30.0, "Must achieve ≥30 tok/s");
assert!(output.contains_coherent_text());
}
9. Revision History
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2025-01-15 | Initial specification |
10. References
[1] Goto, K., & Van Geijn, R. A. (2008). "Anatomy of High-Performance Matrix Multiplication." ACM TOMS. [2] Intel Corporation. (2024). "Intel 64 and IA-32 Architectures Optimization Reference Manual." [3] Dettmers, T., et al. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS. [4] llamafile sgemm: https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/sgemm.cpp [5] Trueno Phase 2 Micro-Kernel: phase2-microkernel.md
Specification for Trueno Phase 15 (2025-01-15) Zero excuses. Zero defects. APR IS THE FORMAT.