The V1_004 chain

V1_004 — "Phase 6 bench non-zero student pass rate against a Qwen3-Coder-30B-A3B-Instruct GGUF" — is the open gate. The chain of work toward discharging it has produced the most empirically interesting body of findings in CCPA's history.

This chapter is the canonical record of that chain.

The chain at a glance

M-row	Date (2026)	What it shipped
M280	05-19	Phase 6 SUSPENSION declared (1.5B model below testability floor)
M286	05-20	M32d MoE KV cache shipped (19× speedup on Qwen3-MoE)
M287	05-20	Greedy baseline: uniform `driver_error` ("Human:" infinite loop)
M288	05-20	Diagnosis: 3 root causes (no EOS stop_token, no clean_chat_output, no few-shot prompt)
M289	05-20	Plumbing shipped: 3-knob HTTP wire-up (`APR_AGENT_TEMPERATURE`, etc.)
M290	05-20	5-PR snapshot: aprender#1832, #1837, #1842, #1844, #1846 all merged
M291	05-21	sub-bench B pattern shift: `driver_error` → `oracle_failed_after_max_turns` (text-only loops, 0 tool_calls)
M292	05-21	`ArenaOutcome::AgentTextLoop` detector + 7 tests (Gap 3 closure)
M293	05-21	`PHASE6_MAX_CONSECUTIVE_TEXT_TURNS` env var wiring at script level
M294	05-22	Scope doc for non-Coder Qwen3-30B-A3B-Instruct-2507 A/B; download + smoke confirmed tool_call JSON emission

The hypothesis-evolution narrative

Hypothesis 1 (start of chain): inference stack is the bottleneck

Premise: V1_004 can't discharge because the apr serve inference path for qwen3_moe is too slow / too broken to fit 20 turns × 1024 max_tokens within a 60min wall budget.

Test: ship M32d MoE KV cache (19× speedup), enable 3-knob sampling, add EOS stop_token and clean_chat_output post-strip.

Result: the M287 driver_error pattern (infinite "Human:" loop) was broken. Sub-bench B on Qwen3-Coder-30B-A3B shifted to a diverse outcome distribution.

Conclusion: inference stack was a necessary but not sufficient fix.

Hypothesis 2 (M291): few-shot prompt is the bottleneck

Premise: the model is now finite-output (M287 runaway broken), but it emits Markdown rust blocks instead of <tool_call> JSON. Adding 3 concrete <tool_call> few-shot examples in CODE_SYSTEM_PROMPT (#1849) should override the Markdown prior.

Test: sub-bench B with #1849's few-shot prompt + 3-knob sampling + EOS + clean_chat_output.

Result: fixture 1 of sub-bench B → oracle_failed_after_max_turns turns=20, ALL 20 turns text-only, tool_use_count: 0. The prompt fix didn't shift behavior.

Conclusion: refuted. Few-shot examples didn't override the model's training distribution.

Hypothesis 3 (M291): active-params count is the bottleneck

Premise: Qwen3-Coder-30B-A3B is 30B-total / 3B-active (MoE routing). Maybe 3B active params is below the agentic-code floor. A dense 7B (Qwen2.5-Coder-7B-Instruct) with 2.3× more active params should fare better.

Test: 17/20 fixtures of Qwen2.5-Coder-7B-Instruct under same 3-knob config.

Result: 12× wall_timeout, 3× oracle_failed_after_max_turns, 2× driver_error, 0 oracle_passed, 0 tool_calls across all inspected fixtures. Same Markdown-block pattern.

Conclusion: refuted. Active params count isn't the variable.

Hypothesis 4 (M294, current): Qwen-Coder finetune family is the bottleneck

Premise: both tested models (Qwen3-Coder-30B-A3B and Qwen2.5-Coder-7B-Instruct) are Qwen-Coder finetunes. Maybe the Coder finetune family specifically has a sticky Markdown-block training prior. A non-Coder Instruct variant — same Qwen3-MoE architecture, same active-param count — should fare better.

Test: smoke Qwen3-30B-A3B-Instruct-2507 (non-Coder) with same CODE_SYSTEM_PROMPT + fixture 1 prompt.

Result: the model emitted {"name": "file_read", "input": {"path": "src/lib.rs"}} + </tool_call> in 20 completion tokens, finish_reason: stop. Categorically different from Coder family (which always emitted 500+ tokens of Markdown).

Conclusion: empirically confirmed at smoke level. Full bench corpus in progress as of 2026-05-22.

What this means for V1_004

V1_004's gate text names Qwen3-Coder-30B-A3B-Instruct specifically. A successful Qwen3-30B-A3B-Instruct-2507 (non-Coder) dispatch is diagnostic evidence, not a contract-level discharge of V1_004 as written.

The path forward, post-empirical-confirmation:

(a) Amend V1_004's gate text to allow any qwen3_moe architecture (via the M22 5-step ritual: contract bump in aprender → fixture update → coverage rerun → calibration record → CCPA-side mirror PR)
(b) OR propose a new gate (V1_005?) against the non-Coder variant
(c) OR engineer a post-decode Markdown→tool_call parser in apr code to unlock Qwen-Coder family for the existing V1_004 gate

This is an operator-coordinated decision tree. The empirical work has produced the evidence; the contract-level choice is upstream.

CCPA — The Claude Code Parity Harness