Outcome variants

ArenaOutcome is the closed enum capturing every way an Arena session can end. It's the unit aggregate scoring pattern-matches on.

The full enum (post-M292)

#[serde(tag = "kind", rename_all = "snake_case")]
pub enum ArenaOutcome {
    OraclePassed                  { turns: u32, wall_seconds: u64 },
    OracleFailedAfterMaxTurns     { turns: u32, partial_pass_rate: Option<f64> },
    WallTimeout                   { turns_at_timeout: u32, max_wall_seconds: u64 },
    DriverError                   { reason: String, turns_before_error: u32 },
    ComplianceFailed              { check: ComplianceCheck, turn: u32 },
    ComplianceTrap                { file: String, last_reason: String, consecutive_count: u32 },
    AgentTextLoop                 { consecutive_text_turns: u32, last_text_excerpt: String },
}

Decision matrix

Outcome	Means	What aggregate score should treat it as
`OraclePassed`	Agent fully solved the fixture. (Phase 6: AND compliance passed.)	`oracle_passed = true`
`OracleFailedAfterMaxTurns`	Agent engaged, but didn't solve within 20 turns.	`oracle_passed = false`
`WallTimeout`	Agent ran out of wall-clock budget mid-session.	`oracle_passed = false`
`DriverError`	Driver subprocess crashed / hung / lost connection.	`oracle_passed = false`, count as infrastructure failure
`ComplianceFailed` (Phase 6)	`cargo test` passed, `pmat comply check` rejected.	`oracle_passed = false`, count toward compliance_cost_ratio numerator
`ComplianceTrap` (Phase 6)	Same `(file, sha256)` failed N consecutive turns.	`oracle_passed = false`, count toward token-cost-avoidance
`AgentTextLoop` (M292, opt-in)	N consecutive text-only turns (no tool_call).	`oracle_passed = false`, agent didn't engage

Why this many variants

Each variant captures a distinct failure mode that the team has empirically observed and decided is worth distinguishing. Conflating them loses signal:

OracleFailedAfterMaxTurns says "the agent worked but produced wrong output." Diagnostic action: inspect history for off-by-one fixes, missing edge cases.
WallTimeout says "the agent worked too slowly." Diagnostic action: check inference tok/s, max_tokens cap, network latency.
DriverError says "the infrastructure broke." Diagnostic action: check apr serve crash logs, network, ports, GPU OOM.
ComplianceTrap says "the agent is stuck making the same violating edit." Diagnostic action: check whether the agent has the compliance rules in context.
AgentTextLoop says "the agent talked but didn't act." Diagnostic action: check tool_call format adherence (this is the M291 finding signature).

Before M292, all the "talked but didn't act" cases were OracleFailedAfterMaxTurns — conflated with "did real work but wrong answer." Adding the AgentTextLoop variant let us measure the difference cleanly.

How aggregate scoring uses outcomes

fn passed(&self) -> bool {
    matches!(self, Self::OraclePassed { .. })
}

fn compliance_failed(&self) -> bool {
    matches!(self,
        Self::ComplianceFailed { .. } | Self::ComplianceTrap { .. }
    )
}

recovery_rate (Phase 5) counts OraclePassed fixtures where the agent recovered from at least one non-zero exit. compliance_cost_ratio (Phase 6) is compliance_failed_under_contract / oracle_passed_baseline (i.e., what fraction of fixtures that pass uncontract'd would fail under-contract).

CCPA — The Claude Code Parity Harness

Outcome variants

The full enum (post-M292)

Decision matrix

Why this many variants

How aggregate scoring uses outcomes