Phase 5 — project-scale Arena
Phase 5 (M194-M210) was the first Arena dispatch against real GitHub-issue Rust fixtures. It produced the Popperian-falsification result that established project-scale measurement as the ground truth.
Corpus
fixtures/project-scale/ — 5 real Rust bug fixtures hand-curated from GitHub issues:
- Each fixture has a
cwd-tree/(a snapshot of the repo at the buggy commit), aprompt.txt(the issue text or a derived task), and a test-shaped oracle (cargo test+ an expected pattern). - Fixtures span error-handling, async edge cases, FFI boundaries, lifetime issues, and macro-related bugs.
Headline result (M234)
| Side | Oracle pass | Recovery (one bash-fail then pass) | Recovery rate |
|---|---|---|---|
claude (teacher) | 1/5 | 1 | 1.00 (1 of 1 passes had recovery) |
apr code (student) | 0/5 | 0 | undefined (0/0) |
apr code's 0/5 was uniform OracleFailedAfterMaxTurns — the agent engaged but couldn't solve the bugs within the 20-turn / 900s budget.
What M234 falsified
The static-fixture parity score of 1.0000 on the canonical corpus (fixtures/canonical/, n=30, M150) does NOT predict project-scale Arena performance. The two systems are functionally interchangeable on single-prompt code generation (HumanEval-class) but diverge on multi-turn project-scale work.
Per the Popperian discipline, this is a clean falsification, not a contradiction. Both measurements are valid; they measure different things. The static path measures the meter; the Arena path measures the system.
docs/specifications/completeness-assessment.md is the honest record of this. The README's "honest framing" paragraph quotes the same finding.
Why the Arena bench is operator-coordinated
A full Arena run consumes:
claudeAPI costs (one paidclaude --printinvocation per turn × up to 20 turns × 5 fixtures × 2 dispatches per measurement)- Local GPU/CPU compute for
apr code'sapr serve(GGUF model loaded into VRAM/RAM) - A
claude loginsession that must not be reused across machines or breached by intermediate proxies
These costs are externalized — CI dispatches static-path tests only. Arena dispatches are operator-dispatched, evidence-captured, and stamped into evidence/phase-5/arena-scores.json. This is contract-gated by FALSIFY-CCPA-019 (calibration_required_before_verdict).
Sub-deliverables (P5.1-P5.5)
- P5.1 (M194-M196) —
ArenaSessionscaffolding type - P5.2 (M197-M210) — multi-turn loop body, tool dispatch, oracle integration, MockDriver for tests
- P5.3 (M211-M222) — corpus walker (
ccpa-arena-bench), aggregate scoring, recovery_rate - P5.4 (M223-M228) — bidirectional sensitivity calibration + the M196-M224 4-bug stack closure
- P5.5 (M229-M234) — first end-to-end Arena dispatch + scores.json + Popperian-falsification finding