M286 — M32d MoE KV cache shipped
Date: 2026-05-20
aprender PR: #1832
What it shipped: forward_single_qwen3_moe_with_cache — a per-token cache-aware MoE forward path for the qwen3_moe architecture.
Why it was necessary
The original qwen3_moe inference path in apr serve was per-full-prompt: every new token required re-processing the entire context from scratch. For a 1024-token max-tokens cap on a 7-turn conversation (~3000 prompt tokens accumulated), this meant O(n²) work per turn.
Empirically: a single 20-turn fixture on Qwen3-Coder-30B-A3B at this regime took ~34min per turn on CPU. The M286 cache implementation cut it to ~6min per fixture (across all turns) — a 19× speedup.
What it changed structurally
old: prompt → embed → 48× (attention + MoE FFN) → LM head → next_token
(re-runs entire context every token)
new: if first_token:
prompt → embed → 48× (attention with cache.append + MoE FFN) → LM head → next_token
else:
last_token_embed → 48× (attention with cache.get_k/get_v GQA + MoE FFN) → LM head → next_token
(only the new token is processed; cache provides past K/V)
The implementation lives in crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs. The single_cache_final_output helper (final norm + LM head) was bumped from private to pub(crate) to allow the MoE module to share it with the dense path.
Falsifiers shipped with it
qwen3-moe-serve-dispatch-v1(V1_001) → ACTIVE_RUNTIMEmoe_kv_cache_equivalence— numerical-equivalence test: cache-on vs cache-off forward passes produce identical logits modulo F32 precisionm32d_perf— ≥5 tok/s floor under CPU compute mode
Why this was the unlock
Without M286, V1_004 was a memory-cost problem (the test couldn't be run within reasonable wall-time on the operator's GPU/CPU budget). With M286, the wall-cost dropped 19×, enabling the empirical chain that followed (M287, M291, M294).
M286 is the load-bearing inference infrastructure for every Phase 6 dispatch.