The two measurement paths

CCPA's parity score is the output of two complementary measurement paths that cross-falsify each other.

Path 1 — Static (the meter)

fixtures/canonical/<id>/teacher.ccpa-trace.jsonl  ◄── AUTHORED
                                ▲
                                │  per-tool equivalence rules
                                │  + hook + skill projections
                                ▼
fixtures/canonical/<id>/student.ccpa-trace.jsonl  ◄── AUTHORED
                        │
                        ▼
            ccpa-differ::compute_parity_score
                        │
                        ▼
                    ParityReport
                  { score, drifts[] }
  • What it validates: the meter. Does the differ recognize equivalent actions? Does it catch the kinds of drift we care about? Does it ignore the noise we choose to ignore?
  • How it's wired: 30 canonical fixtures + a regression corpus (bidirectional sensitivity proof, M9) + per-PR CI hard-blocker (FALSIFY-CCPA-007 since M16).
  • What it cannot do: tell you whether apr code actually solves real tasks. Trace pairs are AUTHORED; they prove the differ logic, not the real-world capability gap.

Path 2 — Arena (the system)

fixtures/project-scale/<id>/{prompt.txt, cwd-tree/}
                        │
                        ▼
       Arena runner: live claude + live apr code
        (multi-turn, max_turns=20, wall=900s default)
                        │
                        ▼
            per-fixture oracle (cargo test 2>&1 | grep "test result: ok")
                        │
                        ▼
                    ArenaOutcome
            { OraclePassed | OracleFailedAfterMaxTurns
              | WallTimeout | DriverError | ComplianceFailed
              | ComplianceTrap | AgentTextLoop (M292) }
                        │
                        ▼
              evidence/phase-{5,6}/arena-scores.json
  • What it validates: the system. Does apr code solve real Rust bugs the way claude does?
  • How it's wired: multi-turn live subprocess dispatch. Operator-coordinated (requires claude login + a local GGUF model + GPU/CPU compute budget). Phase 5 (M194-M210) shipped the project-scale corpus; Phase 6 (M250+) adds the under-contract dispatch (per-turn pmat comply check --strict to measure compliance cost).
  • What it cannot do: tell you that the differ logic is right. Arena measures end-to-end behavior, not action-stream equivalence.

Why both?

Each path has a different failure mode that the other catches:

  • Static path alone would let apr code "pass" by producing traces that look like claude's but cover none of the real-world capability surface. A perfect 1.0 parity score on a curated corpus means nothing if apr code can't solve a real bug.
  • Arena path alone would let apr code "pass" by producing solutions that happen to work but via wildly different action sequences (e.g., a single 5000-line file_write vs. claude's careful read-edit-test loop). Outcome parity ≠ action parity; both matter.

FALSIFY-CCPA-019 (calibration_required_before_verdict) and FALSIFY-CCPA-016 (outcome_parity_bound) jointly enforce that the two paths' verdicts must agree, or the disagreement must be calibrated and explained.

When the paths disagree — the Popperian discipline

The M234 finding (phase-5 results) was a clean Popperian-falsification of the static-fixture approach as a project-scale predictor:

  • Static path: 1.0000 parity on canonical corpus (n=30, M150-M161)
  • Arena path: claude 1/5, apr code 0/5 on phase-5 project-scale corpus (M234)

Direction agrees (claude > apr code), magnitude diverges (1.0 vs 0.0 on Arena despite 1.0 on static). The static result over-predicts at project-scale. This is recorded in docs/specifications/completeness-assessment.md and the Arena scores are the ground-truth for project-scale claims.