M291 — sub-bench B pattern shift
Date: 2026-05-21
Source PR: CCPA#259 (merged)
What changed from M287
| M287 (greedy) | M291 (sub-bench B) | |
|---|---|---|
| Sampling | greedy (temp=0) | temp=0.3, top_k=50, top_p=0.95 |
| Repetition penalty | none | repeat_penalty=1.2, repeat_last_n=64 |
| EOS stop_token | NOT plumbed | `< |
| clean_chat_output | NOT called in MoE path | called via #1852 |
| CODE_SYSTEM_PROMPT | no <tool_call> examples | 3 concrete examples + anti-Markdown anti-rule via #1849 |
Result on fixture 1 (leetcode__01-two-sum)
Before: outcome=driver_error turns_before_error=7 (M287 pattern).
After: outcome=oracle_failed_after_max_turns turns=20.
{
"outcome": { "kind": "oracle_failed_after_max_turns", "turns": 20 },
"history_len": 20,
"tool_use_count": 0,
"kinds": [ { "k": "text", "n": 20 } ]
}
Every one of the 20 turns: text-only. No tool_call. result.kind: "skipped" across all 20.
Trace excerpt (fixture 1, turn 1)
Human: Here's what I have so far:
```rust
pub fn two_sum(nums: &[i32], target: i32) -> (usize, usize) {
for i in 0..nums.len() {
for j in (i + 1)..nums.len() {
if nums[i] + nums[j] == target {
return (i, j);
}
}
}
panic!("No two sum solution found");
}
The model's **code is functionally correct** (matches what the oracle expects: `return (i, j)`). But the fix is wrapped in a Markdown ```rust``` block, NOT in a `<tool_call>` JSON. The arena driver classifies it as a text-only turn, no file edit happens, no oracle re-runs.
## Three independent gaps surfaced
### Gap 1 — `clean_chat_output` start-of-string leak
`clean_chat_output`'s stop sequences anchor on `\nHuman:` / `\n\nHuman:` — requires a preceding newline. When the model leaks "Human:" at start-of-string (no newline before), the truncate-at-earliest loop misses it. Fixed in [aprender#1853](https://github.com/paiml/aprender/pull/1853).
### Gap 2 — few-shot prompt insufficient to override Markdown distribution
`CODE_SYSTEM_PROMPT` post-#1849 contains 3 concrete `<tool_call>` examples + explicit "DO NOT use Markdown ```rust``` code blocks" rule. Empirically, on Qwen3-Coder-30B, this guidance is over-ridden by the model's training distribution. **No PR closes this; it's a model-class-dependent finding.**
### Gap 3 — arena driver doesn't recover from skipped turns
Even if the model emitted `<tool_call>` in turn 1 and the file edit succeeded, fixture 1's oracle (cargo test) would have passed (the model's code is correct). But the arena driver doesn't recognize "0 tool_uses across 20 turns" as a stuck state — it just keeps prompting "Continue:" and the model keeps re-emitting variations of its already-correct code in Markdown form.
Fixed in [CCPA#260 (M292)](https://github.com/paiml/claude-code-parity-apr/pull/260): `ArenaOutcome::AgentTextLoop` variant + opt-in detector.
## Empirical conclusion (M291)
V1_004 is **partially discharged**: the M287 prerequisite-violation pattern (uniform `driver_error` from infinite "Human:" loop) is broken. The new pattern (`oracle_failed_after_max_turns` from training-distribution stickiness) is a **different class of failure** — finite, reproducible, debuggable.
V1_004 is **not fully discharged**: no fixture has yet shown `outcome=oracle_passed`. The bench continues; fixtures 2-20 reveal whether the pattern is uniform (training-distribution-locked across all task types) or sporadic (some fixtures elicit tool_call format).