Outcome variants
ArenaOutcome is the closed enum capturing every way an Arena session can end. It's the unit aggregate scoring pattern-matches on.
The full enum (post-M292)
#[serde(tag = "kind", rename_all = "snake_case")]
pub enum ArenaOutcome {
OraclePassed { turns: u32, wall_seconds: u64 },
OracleFailedAfterMaxTurns { turns: u32, partial_pass_rate: Option<f64> },
WallTimeout { turns_at_timeout: u32, max_wall_seconds: u64 },
DriverError { reason: String, turns_before_error: u32 },
ComplianceFailed { check: ComplianceCheck, turn: u32 },
ComplianceTrap { file: String, last_reason: String, consecutive_count: u32 },
AgentTextLoop { consecutive_text_turns: u32, last_text_excerpt: String },
}
Decision matrix
| Outcome | Means | What aggregate score should treat it as |
|---|---|---|
OraclePassed | Agent fully solved the fixture. (Phase 6: AND compliance passed.) | oracle_passed = true |
OracleFailedAfterMaxTurns | Agent engaged, but didn't solve within 20 turns. | oracle_passed = false |
WallTimeout | Agent ran out of wall-clock budget mid-session. | oracle_passed = false |
DriverError | Driver subprocess crashed / hung / lost connection. | oracle_passed = false, count as infrastructure failure |
ComplianceFailed (Phase 6) | cargo test passed, pmat comply check rejected. | oracle_passed = false, count toward compliance_cost_ratio numerator |
ComplianceTrap (Phase 6) | Same (file, sha256) failed N consecutive turns. | oracle_passed = false, count toward token-cost-avoidance |
AgentTextLoop (M292, opt-in) | N consecutive text-only turns (no tool_call). | oracle_passed = false, agent didn't engage |
Why this many variants
Each variant captures a distinct failure mode that the team has empirically observed and decided is worth distinguishing. Conflating them loses signal:
OracleFailedAfterMaxTurnssays "the agent worked but produced wrong output." Diagnostic action: inspect history for off-by-one fixes, missing edge cases.WallTimeoutsays "the agent worked too slowly." Diagnostic action: check inference tok/s, max_tokens cap, network latency.DriverErrorsays "the infrastructure broke." Diagnostic action: check apr serve crash logs, network, ports, GPU OOM.ComplianceTrapsays "the agent is stuck making the same violating edit." Diagnostic action: check whether the agent has the compliance rules in context.AgentTextLoopsays "the agent talked but didn't act." Diagnostic action: check tool_call format adherence (this is the M291 finding signature).
Before M292, all the "talked but didn't act" cases were OracleFailedAfterMaxTurns — conflated with "did real work but wrong answer." Adding the AgentTextLoop variant let us measure the difference cleanly.
How aggregate scoring uses outcomes
fn passed(&self) -> bool {
matches!(self, Self::OraclePassed { .. })
}
fn compliance_failed(&self) -> bool {
matches!(self,
Self::ComplianceFailed { .. } | Self::ComplianceTrap { .. }
)
}
recovery_rate (Phase 5) counts OraclePassed fixtures where the agent recovered from at least one non-zero exit. compliance_cost_ratio (Phase 6) is compliance_failed_under_contract / oracle_passed_baseline (i.e., what fraction of fixtures that pass uncontract'd would fail under-contract).