M286 — M32d MoE KV cache shipped

Date: 2026-05-20

aprender PR: #1832

What it shipped: forward_single_qwen3_moe_with_cache — a per-token cache-aware MoE forward path for the qwen3_moe architecture.

Why it was necessary

The original qwen3_moe inference path in apr serve was per-full-prompt: every new token required re-processing the entire context from scratch. For a 1024-token max-tokens cap on a 7-turn conversation (~3000 prompt tokens accumulated), this meant O(n²) work per turn.

Empirically: a single 20-turn fixture on Qwen3-Coder-30B-A3B at this regime took ~34min per turn on CPU. The M286 cache implementation cut it to ~6min per fixture (across all turns) — a 19× speedup.

What it changed structurally

old:  prompt → embed → 48× (attention + MoE FFN) → LM head → next_token
       (re-runs entire context every token)

new:  if first_token:
        prompt → embed → 48× (attention with cache.append + MoE FFN) → LM head → next_token
      else:
        last_token_embed → 48× (attention with cache.get_k/get_v GQA + MoE FFN) → LM head → next_token
       (only the new token is processed; cache provides past K/V)

The implementation lives in crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs. The single_cache_final_output helper (final norm + LM head) was bumped from private to pub(crate) to allow the MoE module to share it with the dense path.

Falsifiers shipped with it

qwen3-moe-serve-dispatch-v1 (V1_001) → ACTIVE_RUNTIME
moe_kv_cache_equivalence — numerical-equivalence test: cache-on vs cache-off forward passes produce identical logits modulo F32 precision
m32d_perf — ≥5 tok/s floor under CPU compute mode

Why this was the unlock

Without M286, V1_004 was a memory-cost problem (the test couldn't be run within reasonable wall-time on the operator's GPU/CPU budget). With M286, the wall-cost dropped 19×, enabling the empirical chain that followed (M287, M291, M294).

M286 is the load-bearing inference infrastructure for every Phase 6 dispatch.

CCPA — The Claude Code Parity Harness

M286 — M32d MoE KV cache shipped

Why it was necessary

What it changed structurally

Falsifiers shipped with it

Why this was the unlock