M287 — greedy baseline pattern
Date: 2026-05-20
Bench wall: ~5hr (20 fixtures × ~15min each, with wall_seconds = 3600 per fixture)
Configuration
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 \
bash scripts/phase-6-bench.sh
Greedy decoding (no temperature, no top_k, no top_p, no repetition penalty). Apr binary post-M32d but pre-3-knob plumbing.
Result
20/20 fixtures: uniform outcome=driver_error.
- Student pass rate: 0/20 (0.00)
- Teacher pass rate: 19/20 (0.95)
- Recovery rate: 0.225
What the trace showed
Inspecting fixture 10's student.bench.json (oo__05-builder-pattern, 7 turns to driver_error):
turn 1 invocation:
"Human: I need to see the full implementation..."
Human: I need to see the full implementation...
Human: I need to see the full implementation...
Human: I need to see the full implementation...
...
The model emitted its own user-turn boundary ("Human:") repeatedly, never stopping. The text grew until the per-turn timeout (900s) fired. The driver then exited with the timeout error, which phase-6-bench.sh recorded as driver_error.
Root cause diagnosis (three independent gaps)
-
No EOS stop_token:
try_qwen3_moe_backendinapr servedidn't populateQuantizedGenerateConfig.stop_tokenswith the model's<|im_end|>EOS, so the decode loop ignored the natural turn boundary. -
No post-decode cleanup:
try_qwen3_moe_backenddidn't callclean_chat_outputto strip leaking "Human:" / "User:" /<|im_end|>prefixes — the runaway leaked into the captured chat response verbatim. -
No format adherence guidance:
CODE_SYSTEM_PROMPTdescribed the<tool_call>format but gave no concrete examples. The 30B-Coder model's training distribution favored Markdown code blocks; without explicit examples it didn't emit<tool_call>JSON.
The dense GGUF path in apr serve handled (1) and (2) correctly; the MoE chat-backend path (added later for qwen3_moe) had a gap.
What M287 unlocked
The uniform driver_error pattern made the failure mode legible. Before M287, the assumption was "Qwen3-Coder-30B can't do agentic coding"; M287's evidence sharpened it to "the runaway is a fixable infrastructure issue, not a fundamental model limit."
The three gaps motivated M288-M290's 5-PR fix burst:
- aprender#1832 — M32d KV cache (already merged)
- aprender#1837 — qwen3-moe-sampling-v1 contract
- aprender#1842 — sampling impl
- aprender#1844 — repetition penalty
- aprender#1846 — 3-knob HTTP wire-up (the operator-facing surface)
- aprender#1849 — few-shot
<tool_call>examples (Gap 3) - aprender#1852 — EOS stop_token + clean_chat_output (Gaps 1 + 2)
- aprender#1853 — clean_chat_output start-of-string leading-prefix strip (M291 follow-on)