Phase 5 — project-scale Arena

Phase 5 (M194-M210) was the first Arena dispatch against real GitHub-issue Rust fixtures. It produced the Popperian-falsification result that established project-scale measurement as the ground truth.

Corpus

fixtures/project-scale/ — 5 real Rust bug fixtures hand-curated from GitHub issues:

  • Each fixture has a cwd-tree/ (a snapshot of the repo at the buggy commit), a prompt.txt (the issue text or a derived task), and a test-shaped oracle (cargo test + an expected pattern).
  • Fixtures span error-handling, async edge cases, FFI boundaries, lifetime issues, and macro-related bugs.

Headline result (M234)

SideOracle passRecovery (one bash-fail then pass)Recovery rate
claude (teacher)1/511.00 (1 of 1 passes had recovery)
apr code (student)0/50undefined (0/0)

apr code's 0/5 was uniform OracleFailedAfterMaxTurns — the agent engaged but couldn't solve the bugs within the 20-turn / 900s budget.

What M234 falsified

The static-fixture parity score of 1.0000 on the canonical corpus (fixtures/canonical/, n=30, M150) does NOT predict project-scale Arena performance. The two systems are functionally interchangeable on single-prompt code generation (HumanEval-class) but diverge on multi-turn project-scale work.

Per the Popperian discipline, this is a clean falsification, not a contradiction. Both measurements are valid; they measure different things. The static path measures the meter; the Arena path measures the system.

docs/specifications/completeness-assessment.md is the honest record of this. The README's "honest framing" paragraph quotes the same finding.

Why the Arena bench is operator-coordinated

A full Arena run consumes:

  • claude API costs (one paid claude --print invocation per turn × up to 20 turns × 5 fixtures × 2 dispatches per measurement)
  • Local GPU/CPU compute for apr code's apr serve (GGUF model loaded into VRAM/RAM)
  • A claude login session that must not be reused across machines or breached by intermediate proxies

These costs are externalized — CI dispatches static-path tests only. Arena dispatches are operator-dispatched, evidence-captured, and stamped into evidence/phase-5/arena-scores.json. This is contract-gated by FALSIFY-CCPA-019 (calibration_required_before_verdict).

Sub-deliverables (P5.1-P5.5)

  • P5.1 (M194-M196) — ArenaSession scaffolding type
  • P5.2 (M197-M210) — multi-turn loop body, tool dispatch, oracle integration, MockDriver for tests
  • P5.3 (M211-M222) — corpus walker (ccpa-arena-bench), aggregate scoring, recovery_rate
  • P5.4 (M223-M228) — bidirectional sensitivity calibration + the M196-M224 4-bug stack closure
  • P5.5 (M229-M234) — first end-to-end Arena dispatch + scores.json + Popperian-falsification finding