Outcome variants

ArenaOutcome is the closed enum capturing every way an Arena session can end. It's the unit aggregate scoring pattern-matches on.

The full enum (post-M292)

#[serde(tag = "kind", rename_all = "snake_case")]
pub enum ArenaOutcome {
    OraclePassed                  { turns: u32, wall_seconds: u64 },
    OracleFailedAfterMaxTurns     { turns: u32, partial_pass_rate: Option<f64> },
    WallTimeout                   { turns_at_timeout: u32, max_wall_seconds: u64 },
    DriverError                   { reason: String, turns_before_error: u32 },
    ComplianceFailed              { check: ComplianceCheck, turn: u32 },
    ComplianceTrap                { file: String, last_reason: String, consecutive_count: u32 },
    AgentTextLoop                 { consecutive_text_turns: u32, last_text_excerpt: String },
}

Decision matrix

OutcomeMeansWhat aggregate score should treat it as
OraclePassedAgent fully solved the fixture. (Phase 6: AND compliance passed.)oracle_passed = true
OracleFailedAfterMaxTurnsAgent engaged, but didn't solve within 20 turns.oracle_passed = false
WallTimeoutAgent ran out of wall-clock budget mid-session.oracle_passed = false
DriverErrorDriver subprocess crashed / hung / lost connection.oracle_passed = false, count as infrastructure failure
ComplianceFailed (Phase 6)cargo test passed, pmat comply check rejected.oracle_passed = false, count toward compliance_cost_ratio numerator
ComplianceTrap (Phase 6)Same (file, sha256) failed N consecutive turns.oracle_passed = false, count toward token-cost-avoidance
AgentTextLoop (M292, opt-in)N consecutive text-only turns (no tool_call).oracle_passed = false, agent didn't engage

Why this many variants

Each variant captures a distinct failure mode that the team has empirically observed and decided is worth distinguishing. Conflating them loses signal:

  • OracleFailedAfterMaxTurns says "the agent worked but produced wrong output." Diagnostic action: inspect history for off-by-one fixes, missing edge cases.
  • WallTimeout says "the agent worked too slowly." Diagnostic action: check inference tok/s, max_tokens cap, network latency.
  • DriverError says "the infrastructure broke." Diagnostic action: check apr serve crash logs, network, ports, GPU OOM.
  • ComplianceTrap says "the agent is stuck making the same violating edit." Diagnostic action: check whether the agent has the compliance rules in context.
  • AgentTextLoop says "the agent talked but didn't act." Diagnostic action: check tool_call format adherence (this is the M291 finding signature).

Before M292, all the "talked but didn't act" cases were OracleFailedAfterMaxTurns — conflated with "did real work but wrong answer." Adding the AgentTextLoop variant let us measure the difference cleanly.

How aggregate scoring uses outcomes

fn passed(&self) -> bool {
    matches!(self, Self::OraclePassed { .. })
}

fn compliance_failed(&self) -> bool {
    matches!(self,
        Self::ComplianceFailed { .. } | Self::ComplianceTrap { .. }
    )
}

recovery_rate (Phase 5) counts OraclePassed fixtures where the agent recovered from at least one non-zero exit. compliance_cost_ratio (Phase 6) is compliance_failed_under_contract / oracle_passed_baseline (i.e., what fraction of fixtures that pass uncontract'd would fail under-contract).