Phase 6 — under-contract dispatch
Phase 6 (M250+) extends the Arena to measure not just "did the agent solve the bug?" but "did the agent solve the bug in a compliance-respecting way?"
What "under contract" means
In Phase 5, the only oracle is cargo test. An agent can pass that oracle while emitting code that violates pmat comply check --strict (the project's quality posture: complexity caps, lint rules, allowed-unwrap policy, etc.).
In Phase 6, the oracle is compound:
oracle_passed iff (cargo_test_exit_code == 0
AND grep "test result: ok" in test output
AND pmat comply check --strict exit_code == 0)
pmat comply runs at the end of the session AND after every Write / Edit if --compliance-enforced is set (per-turn compliance gating).
The four Phase-6-specific outcomes
| Outcome | When |
|---|---|
ComplianceFailed { check, turn } | Cargo test passed, but final-state compliance check rejected. Distinct from OracleFailedAfterMaxTurns. |
ComplianceTrap { file, last_reason, consecutive_count } | Same (file, sha256) failed compliance N turns in a row (default 3). Saves token cost. |
AgentTextLoop { consecutive_text_turns, last_text_excerpt } (M292) | N consecutive text-only turns (no tool_call). Opt-in via PHASE6_MAX_CONSECUTIVE_TEXT_TURNS > 0. |
OraclePassed (Phase 6 sense) | BOTH cargo test AND pmat comply check --strict pass. |
The V1 falsifiers added at Phase 6
| ID | Name | Status | Asserted by |
|---|---|---|---|
V1_001 | qwen3_moe_serve_dispatch_v1 | ACTIVE_RUNTIME | aprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs |
V1_002 | qwen3_moe_sampling_v1 | ACTIVE_RUNTIME | sampling integration tests |
V1_003 | qwen3_moe_streaming_sse_v1 | DISCHARGED on gx10 Blackwell | streaming SSE test + evidence |
V1_004 | phase_6_bench_non_zero_student_pass_rate | open | per-fixture student_pass_rate > 0 |
Current state of V1_004
V1_004 is the OPEN gate. The bar: "ANY single Phase 6 fixture passes the compound oracle on the student side."
The M286-M294 chain has shipped 6 aprender PRs + 4 CCPA PRs working toward V1_004 discharge:
- M286 — M32d MoE KV cache (19× speedup; the load-bearing inference infrastructure)
- M287 — greedy baseline confirms M287 driver_error pattern (model entered "Human:" infinite loop)
- M288-M290 — diagnosed 3 root causes; shipped sampling (temperature/top_k/top_p), repetition penalty, EOS stop_token, clean_chat_output, few-shot CODE_SYSTEM_PROMPT
- M291 — sub-bench B on Qwen3-Coder-30B-A3B with all fixes: pattern shifted from
driver_errortooracle_failed_after_max_turnswithtool_use_count: 0 - M292 —
ArenaOutcome::AgentTextLoopdetector + opt-in cap (Gap 3 closure) - M293 —
PHASE6_MAX_CONSECUTIVE_TEXT_TURNSenv var wiring - M294 — scope doc for the non-Coder Qwen3-30B-A3B-Instruct-2507 A/B
See The V1_004 chain for the empirical narrative.
Phase 6 corpus — fixtures/under-contract/
20 fixtures across 4 classes:
- leetcode (5) — algorithmic bugs: two_sum, valid-parentheses, longest-common-prefix, merge-sorted-arrays, binary-search
- oo (5) — object-oriented Rust patterns: bank-account, library-borrowing, shape-hierarchy, observer-pattern, builder-pattern
- transpile (5) — format converters: json-to-toml, csv-to-jsonl, markdown-to-html, ini-to-yaml, regex-to-glob
- unix (5) — CLI utility reimplementations: wc, head, tail, cut, sort
Each fixture's meta.toml includes oracle_cmd = "cargo test 2>&1" and expected_pattern = "test result: ok". The compound oracle adds pmat comply check --strict on the post-mutation tree.