M294 — finetune-distribution A/B
Date: 2026-05-22
Source PR: CCPA#262 (scope doc)
The hypothesis (refined to its sharpest form)
Through M286-M293 + the 17/20 Qwen2.5-Coder-7B-Instruct follow-on, four candidate variables were tested as the load-bearing one behind the 0%-tool_call signature:
| Variable | Test | Outcome |
|---|---|---|
| Inference stack quality | M286 KV cache + 3-knob + EOS + clean_chat_output | Necessary fix; not sufficient |
| Active params count | 3B (30B-A3B-MoE) vs 7B (dense 7B-Coder) | Both show same 0 tool_calls — refuted |
| MoE vs dense | qwen3_moe (30B-A3B) vs qwen2 (7B-dense) | Both show same pattern — refuted |
| Few-shot prompt examples | 3 concrete <tool_call> examples + anti-Markdown rule | No shift in pattern — refuted |
The remaining variable: Qwen-Coder finetune family specifically. Both tested models (Qwen3-Coder-30B-A3B + Qwen2.5-Coder-7B-Instruct) share the Coder-specific finetune.
The hypothesis being tested at M294: hold architecture, size, inference stack constant; vary only the finetune. Specifically: swap Qwen3-Coder-30B-A3B-Instruct for Qwen3-30B-A3B-Instruct-2507 (non-Coder, same MoE arch, same size, same active params, broader instruction + tool-use training distribution).
The smoke test (one-shot, no full bench)
While downloading the 18GB Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf, the operator pointed out that waiting 40 minutes for fixture 1 was unnecessary — a single targeted smoke against the exact same system prompt + user prompt the bench would use would give the answer in 30 seconds.
The smoke payload:
- System: full
CODE_SYSTEM_PROMPT(the same one inapr code, with the 3<tool_call>few-shot examples and anti-Markdown rule) - User: fixture 1 (
leetcode__01-two-sum) prompt - Config: temp=0.3, top_k=50, top_p=0.95, repeat_penalty=1.2, repeat_last_n=64 (sub-bench B config)
- max_tokens: 400
The response:
{"name": "file_read", "input": {"path": "src/lib.rs"}}
</tool_call>
- 20 completion tokens
finish_reason: "stop"- Structured JSON tool_call (missing leading
<tool_call>tag, but the body is exactly what the parser expects) - No "Human:" leak, no Markdown
rustblock, no rambling
Empirical conclusion
The Coder-finetune-distribution hypothesis is empirically confirmed at the smoke level. The non-Coder Instruct variant emits structured tool_call JSON in 20 tokens; the Coder variant emits 500+ tokens of Markdown explanation.
Whether the full bench discharges V1_004 (i.e., oracle_passed > 0) depends on whether:
- The arena parser handles the missing leading
<tool_call>opening tag (bare JSON body) - The model maintains the tool_call format across all 20 turns of a fixture
- The model's code quality is correct (separately from format adherence)
What M294 unblocks
If the full bench shows ≥1 oracle_passed:
- V1_004's open question is empirically answered: the bottleneck is finetune-distribution.
- V1_004 as written names Qwen3-Coder-30B-A3B-Instruct specifically — a discharge requires either a contract amendment (M22 5-step ritual) or a new V1_005 gate.
- M280 SUSPENSION can be lifted on a contract-level basis.
If the full bench still shows 0 oracle_passed:
- The tool_call emission is necessary but not sufficient.
- Code quality / correctness becomes the next variable to investigate.
- A post-decode parser in
apr codethat converts Markdownrustblocks tofile_editcalls becomes a higher-priority engineering target (which would unlock Qwen-Coder family for V1_004 as written).