The V1_004 chain
V1_004 — "Phase 6 bench non-zero student pass rate against a Qwen3-Coder-30B-A3B-Instruct GGUF" — is the open gate. The chain of work toward discharging it has produced the most empirically interesting body of findings in CCPA's history.
This chapter is the canonical record of that chain.
The chain at a glance
| M-row | Date (2026) | What it shipped |
|---|---|---|
| M280 | 05-19 | Phase 6 SUSPENSION declared (1.5B model below testability floor) |
| M286 | 05-20 | M32d MoE KV cache shipped (19× speedup on Qwen3-MoE) |
| M287 | 05-20 | Greedy baseline: uniform driver_error ("Human:" infinite loop) |
| M288 | 05-20 | Diagnosis: 3 root causes (no EOS stop_token, no clean_chat_output, no few-shot prompt) |
| M289 | 05-20 | Plumbing shipped: 3-knob HTTP wire-up (APR_AGENT_TEMPERATURE, etc.) |
| M290 | 05-20 | 5-PR snapshot: aprender#1832, #1837, #1842, #1844, #1846 all merged |
| M291 | 05-21 | sub-bench B pattern shift: driver_error → oracle_failed_after_max_turns (text-only loops, 0 tool_calls) |
| M292 | 05-21 | ArenaOutcome::AgentTextLoop detector + 7 tests (Gap 3 closure) |
| M293 | 05-21 | PHASE6_MAX_CONSECUTIVE_TEXT_TURNS env var wiring at script level |
| M294 | 05-22 | Scope doc for non-Coder Qwen3-30B-A3B-Instruct-2507 A/B; download + smoke confirmed tool_call JSON emission |
The hypothesis-evolution narrative
Hypothesis 1 (start of chain): inference stack is the bottleneck
Premise: V1_004 can't discharge because the apr serve inference path for qwen3_moe is too slow / too broken to fit 20 turns × 1024 max_tokens within a 60min wall budget.
Test: ship M32d MoE KV cache (19× speedup), enable 3-knob sampling, add EOS stop_token and clean_chat_output post-strip.
Result: the M287 driver_error pattern (infinite "Human:" loop) was broken. Sub-bench B on Qwen3-Coder-30B-A3B shifted to a diverse outcome distribution.
Conclusion: inference stack was a necessary but not sufficient fix.
Hypothesis 2 (M291): few-shot prompt is the bottleneck
Premise: the model is now finite-output (M287 runaway broken), but it emits Markdown rust blocks instead of <tool_call> JSON. Adding 3 concrete <tool_call> few-shot examples in CODE_SYSTEM_PROMPT (#1849) should override the Markdown prior.
Test: sub-bench B with #1849's few-shot prompt + 3-knob sampling + EOS + clean_chat_output.
Result: fixture 1 of sub-bench B → oracle_failed_after_max_turns turns=20, ALL 20 turns text-only, tool_use_count: 0. The prompt fix didn't shift behavior.
Conclusion: refuted. Few-shot examples didn't override the model's training distribution.
Hypothesis 3 (M291): active-params count is the bottleneck
Premise: Qwen3-Coder-30B-A3B is 30B-total / 3B-active (MoE routing). Maybe 3B active params is below the agentic-code floor. A dense 7B (Qwen2.5-Coder-7B-Instruct) with 2.3× more active params should fare better.
Test: 17/20 fixtures of Qwen2.5-Coder-7B-Instruct under same 3-knob config.
Result: 12× wall_timeout, 3× oracle_failed_after_max_turns, 2× driver_error, 0 oracle_passed, 0 tool_calls across all inspected fixtures. Same Markdown-block pattern.
Conclusion: refuted. Active params count isn't the variable.
Hypothesis 4 (M294, current): Qwen-Coder finetune family is the bottleneck
Premise: both tested models (Qwen3-Coder-30B-A3B and Qwen2.5-Coder-7B-Instruct) are Qwen-Coder finetunes. Maybe the Coder finetune family specifically has a sticky Markdown-block training prior. A non-Coder Instruct variant — same Qwen3-MoE architecture, same active-param count — should fare better.
Test: smoke Qwen3-30B-A3B-Instruct-2507 (non-Coder) with same CODE_SYSTEM_PROMPT + fixture 1 prompt.
Result: the model emitted {"name": "file_read", "input": {"path": "src/lib.rs"}} + </tool_call> in 20 completion tokens, finish_reason: stop. Categorically different from Coder family (which always emitted 500+ tokens of Markdown).
Conclusion: empirically confirmed at smoke level. Full bench corpus in progress as of 2026-05-22.
What this means for V1_004
V1_004's gate text names Qwen3-Coder-30B-A3B-Instruct specifically. A successful Qwen3-30B-A3B-Instruct-2507 (non-Coder) dispatch is diagnostic evidence, not a contract-level discharge of V1_004 as written.
The path forward, post-empirical-confirmation:
- (a) Amend V1_004's gate text to allow any qwen3_moe architecture (via the M22 5-step ritual: contract bump in aprender → fixture update → coverage rerun → calibration record → CCPA-side mirror PR)
- (b) OR propose a new gate (V1_005?) against the non-Coder variant
- (c) OR engineer a post-decode Markdown→tool_call parser in
apr codeto unlock Qwen-Coder family for the existing V1_004 gate
This is an operator-coordinated decision tree. The empirical work has produced the evidence; the contract-level choice is upstream.