Bidirectional sensitivity

A parity meter has two failure modes:

False positive — declaring drift when traces are actually equivalent. Caught by the canonical corpus (fixtures/canonical/ MUST PASS).
False negative — declaring equivalence when traces actually diverge. Caught by the regression corpus (fixtures/regression/ MUST FAIL).

A meter that passes only the canonical corpus is not validated. It may be passing everything trivially. The regression corpus is the falsifier for the differ itself.

What "bidirectional" means here

The differ must be sensitive in both directions:

                   teacher == student (equivalent)
                              │
                              ▼
                       parity_score == 1.0
                              │
                       (canonical corpus
                        proves this direction)


                   teacher != student (deliberate drift)
                              │
                              ▼
                       parity_score < threshold
                              │
                       (regression corpus
                        proves this direction)

If either direction breaks, the meter is broken. The regression corpus exists because in M9 we caught a class of drift the differ wasn't sensitive to — the canonical corpus passed, but a known-bad pair also passed. That's a Tier 2 meter bug. Bidirectional sensitivity is the falsifier for it.

The M196-M224 bug stack

Through M196-M224 the team encountered four meter bugs in a row, each caught only by bidirectional sensitivity:

Bash command tokenization — cargo test --release and cargo test tokenized identically (the regression fixture for this case exposed it).
Glob result-set hashing — glob.results[] was being compared as a set, not a sequence, allowing reordered results to slip through.
Hook trigger projection — PreToolUse and PostToolUse hooks were collapsing onto the same target.
Sovereignty check ordering — Tier3 detection ran AFTER score computation, so a sovereignty violation could silently lower the score below threshold without being categorically flagged.

Each was caught by a regression fixture that the canonical corpus didn't catch. The four-bug stack is the empirical justification for FALSIFY-CCPA-019 (calibration_required_before_verdict) — every Arena verdict requires a fresh bidirectional sensitivity record on file.

The calibration contract — `FALSIFY-CCPA-019`

Shipped at M236. Codifies the M196-M224 lesson as a permanent gate:

no Arena verdict ships without a CalibrationRecord stamped within the last 90 days

The CalibrationRecord JSON shape lives in crates/ccpa-differ/src/calibration.rs. Each record contains: (a) canonical-corpus passes, (b) regression-corpus fails, (c) Tier3 sovereignty exercises, (d) cross-tool equivalence spot-checks. A stale record fails CI on the next Arena dispatch.

This is the only FALSIFY-CCPA- gate that fires on a measured artifact (a JSON file with a timestamp), not on a code-level test. It's the closest thing CCPA has to a runtime-only contract — and it's there for a hard-earned reason.

CCPA — The Claude Code Parity Harness

Bidirectional sensitivity

What "bidirectional" means here

The M196-M224 bug stack

The calibration contract — FALSIFY-CCPA-019

The calibration contract — `FALSIFY-CCPA-019`