Phase 6 — under-contract dispatch

Phase 6 (M250+) extends the Arena to measure not just "did the agent solve the bug?" but "did the agent solve the bug in a compliance-respecting way?"

What "under contract" means

In Phase 5, the only oracle is cargo test. An agent can pass that oracle while emitting code that violates pmat comply check --strict (the project's quality posture: complexity caps, lint rules, allowed-unwrap policy, etc.).

In Phase 6, the oracle is compound:

oracle_passed iff (cargo_test_exit_code == 0
                   AND grep "test result: ok" in test output
                   AND pmat comply check --strict exit_code == 0)

pmat comply runs at the end of the session AND after every Write / Edit if --compliance-enforced is set (per-turn compliance gating).

The four Phase-6-specific outcomes

Outcome	When
`ComplianceFailed { check, turn }`	Cargo test passed, but final-state compliance check rejected. Distinct from `OracleFailedAfterMaxTurns`.
`ComplianceTrap { file, last_reason, consecutive_count }`	Same `(file, sha256)` failed compliance N turns in a row (default 3). Saves token cost.
`AgentTextLoop { consecutive_text_turns, last_text_excerpt }` (M292)	N consecutive text-only turns (no tool_call). Opt-in via `PHASE6_MAX_CONSECUTIVE_TEXT_TURNS > 0`.
`OraclePassed` (Phase 6 sense)	BOTH cargo test AND `pmat comply check --strict` pass.

The V1 falsifiers added at Phase 6

ID	Name	Status	Asserted by
`V1_001`	`qwen3_moe_serve_dispatch_v1`	ACTIVE_RUNTIME	`aprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs`
`V1_002`	`qwen3_moe_sampling_v1`	ACTIVE_RUNTIME	sampling integration tests
`V1_003`	`qwen3_moe_streaming_sse_v1`	DISCHARGED on gx10 Blackwell	streaming SSE test + evidence
`V1_004`	`phase_6_bench_non_zero_student_pass_rate`	open	per-fixture `student_pass_rate > 0`

Current state of V1_004

V1_004 is the OPEN gate. The bar: "ANY single Phase 6 fixture passes the compound oracle on the student side."

The M286-M294 chain has shipped 6 aprender PRs + 4 CCPA PRs working toward V1_004 discharge:

M286 — M32d MoE KV cache (19× speedup; the load-bearing inference infrastructure)
M287 — greedy baseline confirms M287 driver_error pattern (model entered "Human:" infinite loop)
M288-M290 — diagnosed 3 root causes; shipped sampling (temperature/top_k/top_p), repetition penalty, EOS stop_token, clean_chat_output, few-shot CODE_SYSTEM_PROMPT
M291 — sub-bench B on Qwen3-Coder-30B-A3B with all fixes: pattern shifted from driver_error to oracle_failed_after_max_turns with tool_use_count: 0
M292 — ArenaOutcome::AgentTextLoop detector + opt-in cap (Gap 3 closure)
M293 — PHASE6_MAX_CONSECUTIVE_TEXT_TURNS env var wiring
M294 — scope doc for the non-Coder Qwen3-30B-A3B-Instruct-2507 A/B

See The V1_004 chain for the empirical narrative.

Phase 6 corpus — `fixtures/under-contract/`

20 fixtures across 4 classes:

leetcode (5) — algorithmic bugs: two_sum, valid-parentheses, longest-common-prefix, merge-sorted-arrays, binary-search
oo (5) — object-oriented Rust patterns: bank-account, library-borrowing, shape-hierarchy, observer-pattern, builder-pattern
transpile (5) — format converters: json-to-toml, csv-to-jsonl, markdown-to-html, ini-to-yaml, regex-to-glob
unix (5) — CLI utility reimplementations: wc, head, tail, cut, sort

Each fixture's meta.toml includes oracle_cmd = "cargo test 2>&1" and expected_pattern = "test result: ok". The compound oracle adds pmat comply check --strict on the post-mutation tree.

CCPA — The Claude Code Parity Harness