Phase 6 — under-contract dispatch

Phase 6 (M250+) extends the Arena to measure not just "did the agent solve the bug?" but "did the agent solve the bug in a compliance-respecting way?"

What "under contract" means

In Phase 5, the only oracle is cargo test. An agent can pass that oracle while emitting code that violates pmat comply check --strict (the project's quality posture: complexity caps, lint rules, allowed-unwrap policy, etc.).

In Phase 6, the oracle is compound:

oracle_passed iff (cargo_test_exit_code == 0
                   AND grep "test result: ok" in test output
                   AND pmat comply check --strict exit_code == 0)

pmat comply runs at the end of the session AND after every Write / Edit if --compliance-enforced is set (per-turn compliance gating).

The four Phase-6-specific outcomes

OutcomeWhen
ComplianceFailed { check, turn }Cargo test passed, but final-state compliance check rejected. Distinct from OracleFailedAfterMaxTurns.
ComplianceTrap { file, last_reason, consecutive_count }Same (file, sha256) failed compliance N turns in a row (default 3). Saves token cost.
AgentTextLoop { consecutive_text_turns, last_text_excerpt } (M292)N consecutive text-only turns (no tool_call). Opt-in via PHASE6_MAX_CONSECUTIVE_TEXT_TURNS > 0.
OraclePassed (Phase 6 sense)BOTH cargo test AND pmat comply check --strict pass.

The V1 falsifiers added at Phase 6

IDNameStatusAsserted by
V1_001qwen3_moe_serve_dispatch_v1ACTIVE_RUNTIMEaprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs
V1_002qwen3_moe_sampling_v1ACTIVE_RUNTIMEsampling integration tests
V1_003qwen3_moe_streaming_sse_v1DISCHARGED on gx10 Blackwellstreaming SSE test + evidence
V1_004phase_6_bench_non_zero_student_pass_rateopenper-fixture student_pass_rate > 0

Current state of V1_004

V1_004 is the OPEN gate. The bar: "ANY single Phase 6 fixture passes the compound oracle on the student side."

The M286-M294 chain has shipped 6 aprender PRs + 4 CCPA PRs working toward V1_004 discharge:

  • M286 — M32d MoE KV cache (19× speedup; the load-bearing inference infrastructure)
  • M287 — greedy baseline confirms M287 driver_error pattern (model entered "Human:" infinite loop)
  • M288-M290 — diagnosed 3 root causes; shipped sampling (temperature/top_k/top_p), repetition penalty, EOS stop_token, clean_chat_output, few-shot CODE_SYSTEM_PROMPT
  • M291 — sub-bench B on Qwen3-Coder-30B-A3B with all fixes: pattern shifted from driver_error to oracle_failed_after_max_turns with tool_use_count: 0
  • M292ArenaOutcome::AgentTextLoop detector + opt-in cap (Gap 3 closure)
  • M293PHASE6_MAX_CONSECUTIVE_TEXT_TURNS env var wiring
  • M294 — scope doc for the non-Coder Qwen3-30B-A3B-Instruct-2507 A/B

See The V1_004 chain for the empirical narrative.

Phase 6 corpus — fixtures/under-contract/

20 fixtures across 4 classes:

  • leetcode (5) — algorithmic bugs: two_sum, valid-parentheses, longest-common-prefix, merge-sorted-arrays, binary-search
  • oo (5) — object-oriented Rust patterns: bank-account, library-borrowing, shape-hierarchy, observer-pattern, builder-pattern
  • transpile (5) — format converters: json-to-toml, csv-to-jsonl, markdown-to-html, ini-to-yaml, regex-to-glob
  • unix (5) — CLI utility reimplementations: wc, head, tail, cut, sort

Each fixture's meta.toml includes oracle_cmd = "cargo test 2>&1" and expected_pattern = "test result: ok". The compound oracle adds pmat comply check --strict on the post-mutation tree.