The two measurement paths
CCPA's parity score is the output of two complementary measurement paths that cross-falsify each other.
Path 1 — Static (the meter)
fixtures/canonical/<id>/teacher.ccpa-trace.jsonl ◄── AUTHORED
▲
│ per-tool equivalence rules
│ + hook + skill projections
▼
fixtures/canonical/<id>/student.ccpa-trace.jsonl ◄── AUTHORED
│
▼
ccpa-differ::compute_parity_score
│
▼
ParityReport
{ score, drifts[] }
- What it validates: the meter. Does the differ recognize equivalent actions? Does it catch the kinds of drift we care about? Does it ignore the noise we choose to ignore?
- How it's wired: 30 canonical fixtures + a regression corpus (bidirectional sensitivity proof, M9) + per-PR CI hard-blocker (
FALSIFY-CCPA-007since M16). - What it cannot do: tell you whether
apr codeactually solves real tasks. Trace pairs are AUTHORED; they prove the differ logic, not the real-world capability gap.
Path 2 — Arena (the system)
fixtures/project-scale/<id>/{prompt.txt, cwd-tree/}
│
▼
Arena runner: live claude + live apr code
(multi-turn, max_turns=20, wall=900s default)
│
▼
per-fixture oracle (cargo test 2>&1 | grep "test result: ok")
│
▼
ArenaOutcome
{ OraclePassed | OracleFailedAfterMaxTurns
| WallTimeout | DriverError | ComplianceFailed
| ComplianceTrap | AgentTextLoop (M292) }
│
▼
evidence/phase-{5,6}/arena-scores.json
- What it validates: the system. Does
apr codesolve real Rust bugs the wayclaudedoes? - How it's wired: multi-turn live subprocess dispatch. Operator-coordinated (requires
claude login+ a local GGUF model + GPU/CPU compute budget). Phase 5 (M194-M210) shipped the project-scale corpus; Phase 6 (M250+) adds the under-contract dispatch (per-turnpmat comply check --strictto measure compliance cost). - What it cannot do: tell you that the differ logic is right. Arena measures end-to-end behavior, not action-stream equivalence.
Why both?
Each path has a different failure mode that the other catches:
- Static path alone would let
apr code"pass" by producing traces that look likeclaude's but cover none of the real-world capability surface. A perfect 1.0 parity score on a curated corpus means nothing ifapr codecan't solve a real bug. - Arena path alone would let
apr code"pass" by producing solutions that happen to work but via wildly different action sequences (e.g., a single 5000-line file_write vs. claude's careful read-edit-test loop). Outcome parity ≠ action parity; both matter.
FALSIFY-CCPA-019 (calibration_required_before_verdict) and FALSIFY-CCPA-016 (outcome_parity_bound) jointly enforce that the two paths' verdicts must agree, or the disagreement must be calibrated and explained.
When the paths disagree — the Popperian discipline
The M234 finding (phase-5 results) was a clean Popperian-falsification of the static-fixture approach as a project-scale predictor:
- Static path: 1.0000 parity on canonical corpus (n=30, M150-M161)
- Arena path: claude 1/5, apr code 0/5 on phase-5 project-scale corpus (M234)
Direction agrees (claude > apr code), magnitude diverges (1.0 vs 0.0 on Arena despite 1.0 on static). The static result over-predicts at project-scale. This is recorded in docs/specifications/completeness-assessment.md and the Arena scores are the ground-truth for project-scale claims.