M287 — greedy baseline pattern

Date: 2026-05-20

Bench wall: ~5hr (20 fixtures × ~15min each, with wall_seconds = 3600 per fixture)

Configuration

APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 \
bash scripts/phase-6-bench.sh

Greedy decoding (no temperature, no top_k, no top_p, no repetition penalty). Apr binary post-M32d but pre-3-knob plumbing.

Result

20/20 fixtures: uniform outcome=driver_error.

  • Student pass rate: 0/20 (0.00)
  • Teacher pass rate: 19/20 (0.95)
  • Recovery rate: 0.225

What the trace showed

Inspecting fixture 10's student.bench.json (oo__05-builder-pattern, 7 turns to driver_error):

turn 1 invocation:
  "Human: I need to see the full implementation..."
  Human: I need to see the full implementation...
  Human: I need to see the full implementation...
  Human: I need to see the full implementation...
  ...

The model emitted its own user-turn boundary ("Human:") repeatedly, never stopping. The text grew until the per-turn timeout (900s) fired. The driver then exited with the timeout error, which phase-6-bench.sh recorded as driver_error.

Root cause diagnosis (three independent gaps)

  1. No EOS stop_token: try_qwen3_moe_backend in apr serve didn't populate QuantizedGenerateConfig.stop_tokens with the model's <|im_end|> EOS, so the decode loop ignored the natural turn boundary.

  2. No post-decode cleanup: try_qwen3_moe_backend didn't call clean_chat_output to strip leaking "Human:" / "User:" / <|im_end|> prefixes — the runaway leaked into the captured chat response verbatim.

  3. No format adherence guidance: CODE_SYSTEM_PROMPT described the <tool_call> format but gave no concrete examples. The 30B-Coder model's training distribution favored Markdown code blocks; without explicit examples it didn't emit <tool_call> JSON.

The dense GGUF path in apr serve handled (1) and (2) correctly; the MoE chat-backend path (added later for qwen3_moe) had a gap.

What M287 unlocked

The uniform driver_error pattern made the failure mode legible. Before M287, the assumption was "Qwen3-Coder-30B can't do agentic coding"; M287's evidence sharpened it to "the runaway is a fixable infrastructure issue, not a fundamental model limit."

The three gaps motivated M288-M290's 5-PR fix burst:

  • aprender#1832 — M32d KV cache (already merged)
  • aprender#1837 — qwen3-moe-sampling-v1 contract
  • aprender#1842 — sampling impl
  • aprender#1844 — repetition penalty
  • aprender#1846 — 3-knob HTTP wire-up (the operator-facing surface)
  • aprender#1849 — few-shot <tool_call> examples (Gap 3)
  • aprender#1852 — EOS stop_token + clean_chat_output (Gaps 1 + 2)
  • aprender#1853 — clean_chat_output start-of-string leading-prefix strip (M291 follow-on)