M294 — finetune-distribution A/B

Date: 2026-05-22

Source PR: CCPA#262 (scope doc)

The hypothesis (refined to its sharpest form)

Through M286-M293 + the 17/20 Qwen2.5-Coder-7B-Instruct follow-on, four candidate variables were tested as the load-bearing one behind the 0%-tool_call signature:

VariableTestOutcome
Inference stack qualityM286 KV cache + 3-knob + EOS + clean_chat_outputNecessary fix; not sufficient
Active params count3B (30B-A3B-MoE) vs 7B (dense 7B-Coder)Both show same 0 tool_calls — refuted
MoE vs denseqwen3_moe (30B-A3B) vs qwen2 (7B-dense)Both show same pattern — refuted
Few-shot prompt examples3 concrete <tool_call> examples + anti-Markdown ruleNo shift in pattern — refuted

The remaining variable: Qwen-Coder finetune family specifically. Both tested models (Qwen3-Coder-30B-A3B + Qwen2.5-Coder-7B-Instruct) share the Coder-specific finetune.

The hypothesis being tested at M294: hold architecture, size, inference stack constant; vary only the finetune. Specifically: swap Qwen3-Coder-30B-A3B-Instruct for Qwen3-30B-A3B-Instruct-2507 (non-Coder, same MoE arch, same size, same active params, broader instruction + tool-use training distribution).

The smoke test (one-shot, no full bench)

While downloading the 18GB Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf, the operator pointed out that waiting 40 minutes for fixture 1 was unnecessary — a single targeted smoke against the exact same system prompt + user prompt the bench would use would give the answer in 30 seconds.

The smoke payload:

  • System: full CODE_SYSTEM_PROMPT (the same one in apr code, with the 3 <tool_call> few-shot examples and anti-Markdown rule)
  • User: fixture 1 (leetcode__01-two-sum) prompt
  • Config: temp=0.3, top_k=50, top_p=0.95, repeat_penalty=1.2, repeat_last_n=64 (sub-bench B config)
  • max_tokens: 400

The response:

{"name": "file_read", "input": {"path": "src/lib.rs"}}
</tool_call>
  • 20 completion tokens
  • finish_reason: "stop"
  • Structured JSON tool_call (missing leading <tool_call> tag, but the body is exactly what the parser expects)
  • No "Human:" leak, no Markdown rust block, no rambling

Empirical conclusion

The Coder-finetune-distribution hypothesis is empirically confirmed at the smoke level. The non-Coder Instruct variant emits structured tool_call JSON in 20 tokens; the Coder variant emits 500+ tokens of Markdown explanation.

Whether the full bench discharges V1_004 (i.e., oracle_passed > 0) depends on whether:

  1. The arena parser handles the missing leading <tool_call> opening tag (bare JSON body)
  2. The model maintains the tool_call format across all 20 turns of a fixture
  3. The model's code quality is correct (separately from format adherence)

What M294 unblocks

If the full bench shows ≥1 oracle_passed:

  • V1_004's open question is empirically answered: the bottleneck is finetune-distribution.
  • V1_004 as written names Qwen3-Coder-30B-A3B-Instruct specifically — a discharge requires either a contract amendment (M22 5-step ritual) or a new V1_005 gate.
  • M280 SUSPENSION can be lifted on a contract-level basis.

If the full bench still shows 0 oracle_passed:

  • The tool_call emission is necessary but not sufficient.
  • Code quality / correctness becomes the next variable to investigate.
  • A post-decode parser in apr code that converts Markdown rust blocks to file_edit calls becomes a higher-priority engineering target (which would unlock Qwen-Coder family for V1_004 as written).