CCPA — The Claude Code Parity Harness

CCPA — record-replay-distill harness measuring claude vs apr code

A record-replay-distill harness measuring apr code against Claude Code at the action-stream level.

This book is the reference companion to the claude-code-parity-apr repository. It explains the methodology, the falsifier gates, the empirical findings, and the architectural decisions that shape every measurement.

Why this exists

A sovereign, locally-hosted coding agent (apr code) needs an honest, falsifiable yardstick to measure itself against the industry baseline (Claude Code). Without a rigorous yardstick:

"It works" claims drift from "it works like the reference"
Regressions hide behind narrative
The compliance posture of code an agent emits has no contract gate

CCPA closes that gap with three commitments:

Contract-first. Every behavior gate (FALSIFY-CCPA-001..020) is encoded as a falsifiable assertion in a YAML contract before code lands. Tests prove the gate; pv validate proves the contract; pmat comply proves the project's compliance posture. No code ships without a contract.
Two complementary measurement paths. A static path — authored teacher/student trace pairs scored by a deterministic differ — validates the meter. An Arena path — multi-turn live dispatches of real claude + real apr code against real Rust fixtures with test-shaped oracles — validates the system. The two paths cross-falsify each other.
Empirical calibration. Every Arena verdict requires a fresh bidirectional-sensitivity calibration on file (FALSIFY-CCPA-019). Static-fixture parity is calibrated against project-scale Arena reality; any drift between them is recorded and explained.

Honest framing

At function-scale (single-prompt code generation on HumanEval-style fixtures), claude and apr code are functionally interchangeable — both pass each other's tests (1.0000 parity, n=5, M150).

At project-scale (multi-turn Arena with real GitHub-issue fixtures), the static-fixture approach is Popperian-falsified as a project-scale predictor: claude solves 1/5, apr code 0/5 on phase-5 corpus (M234). Direction agrees with static verdict, magnitudes diverge.

The empirical chain in this book — M1 → M294 — is the honest record of what we measured, when, and how confident we are. Negative results are evidence; this book treats them as such.

Status as of writing

Contract v1.32.0 — 20 gates registered (16 ACTIVE_RUNTIME, 4 PROPOSED)
M0 → M294 all SHIPPED
Phase 6 under-contract dispatch in active operator-coordinated bench cycles against Qwen3-30B-A3B-Instruct-2507
V1_004 (Phase 6 non-zero student pass rate) is the open gate

How to read this book

Want the methodology in 10 minutes? → What is CCPA? + Methodology
Want to add a fixture or run a bench? → CLI reference
Want the empirical story (the interesting part)? → V1_004 chain
Want the academic basis? → Academic basis

License

Apache-2.0 OR MIT. See the repository root.

What is CCPA?

CCPA — the Claude Code Parity for apr code harness — is a measurement system. It does one job: produce a falsifiable, contract-gated parity score between two AI coding agents.

Teacher (the reference): Claude Code — Anthropic's official CLI, treated as the orchestrator and the action-stream baseline.
Student (the sovereign system under test): apr code — a locally-hosted, pure-Rust coding agent that runs against a local GGUF model with no data leaving the machine.

What "parity" means here

Parity is not "the two systems produce identical bytes." Parity is action-stream semantic equivalence under a per-tool rule set.

For each pair of trace records — teacher and student — the differ asks:

Did they invoke the same logical tool? (Bash ↔ Bash, Write ↔ Write, etc.)
Did the tool inputs differ in ways that matter? (commands semantically equivalent? file paths normalized? content byte-equal or text-equivalent?)
Did the resulting file-system mutations agree? (hash-checked)
Did the OS-event trace agree, modulo allowed nondeterminism?

A parity score in [0.0, 1.0] plus a closed enum of DriftCategory for any mismatch is the output. The score and category are mechanically asserted by FALSIFY-CCPA-004 through FALSIFY-CCPA-008.

What CCPA is NOT

Not a benchmark suite for general LLMs. The corpus is curated for the apr code ↔ claude parity question. SWE-bench, HumanEval, and similar exist for general benchmarking.
Not a record-from-API tool. The original HTTPS-proxy recording path is intentionally out of scope post-M222 directive. claude is driven as a subprocess via session-based auth (claude login); CCPA does not use ANTHROPIC_API_KEY and does not call the Anthropic API directly.
Not a unit-test framework for claude. It's a parity harness — the meter between two systems.

Three deliverables, one repository

Deliverable	What it is	Where it lives
The differ	`ccpa-differ` crate + `ccpa diff` / `ccpa corpus` CLI	`crates/ccpa-differ/`
The Arena runner	`ccpa-arena` crate + `ccpa-arena-bench` binary	`crates/ccpa-arena/`
The fixtures	Canonical, regression, project-scale, calibration-and-scale, under-contract	`fixtures/`

All three are governed by one contract YAML — see Methodology.

Methodology — contract-first + falsifier-driven

CCPA is governed by a single methodology, applied uniformly: every behavior gate is an assertion in a YAML contract; the assertion exists before the code that proves it; CI mechanically validates both.

The cycle

1. Behavior identified              →  written prose
2. Falsifier composed               →  "this is exactly the assertion that would
                                       prove the gate WRONG if it failed"
3. Contract entry added             →  contracts/claude-code-parity-apr-v1.yaml
                                       (status: PROPOSED at first)
4. pv validate the contract         →  syntax + schema gate
5. Test that exercises the falsifier→  crates/ccpa-{differ,arena,...}/tests/
                                       (links the gate ID by name)
6. CI hard-blocks                   →  status flips ACTIVE_ALGORITHM_LEVEL
                                       once the test passes deterministically
7. Empirical evidence on file       →  flips ACTIVE_RUNTIME once a real
                                       measured discharge is recorded

No step is optional. No step happens in a different order. The cycle is enforced by FALSIFY-CCPA-012 (pre-commit + CI pv validate) and FALSIFY-CCPA-007 (corpus coverage).

Status flow for any gate

PROPOSED  ──── algorithm-level test passes deterministically ────→  ACTIVE_ALGORITHM_LEVEL
                                                                              │
                                                              measured discharge on file
                                                                              ▼
                                                                       ACTIVE_RUNTIME

PROPOSED: defined in the YAML, not yet asserted by a passing test.
ACTIVE_ALGORITHM_LEVEL: a deterministic test asserts the gate, but no real-world measurement has been recorded yet.
ACTIVE_RUNTIME: a real measured bench run (operator-dispatched, evidence captured) discharged the gate.

See Status flow for the exhaustive transition table.

Three sources of truth

Concern	Lives in	Why
Contract YAML	`paiml/aprender/contracts/claude-code-parity-apr-v1.yaml` (canonical), pinned here via `contracts/pin.lock`	aprender is the org-wide single-source-of-truth for paiml contracts
Spec text	`docs/specifications/claude-code-parity-apr-poc.md`	This repo since M1
Implementation, fixtures, CI, coverage, pmat-comply	this repo	The harness IS the implementation

The split mirrors aprender's monorepo policy: aprender stays canonical for contract TEXT (the shared schema across all paiml contracts), while this repo is canonical for runtime ENFORCEMENT (the tests, fixtures, CI, and pmat comply posture).

Forbidden tools

cargo tarpaulin — slow, unreliable. Use cargo llvm-cov only.
bash re-implementations of pv / pmat / cargo-llvm-cov checks — if pv validate rejects a contract, fix the contract or extend aprender-contracts/src/schema/; do not duplicate validation logic in bash.

Code search policy

pmat query over grep for any Rust code search. pmat query returns quality-annotated, semantically ranked results (TDG grades, complexity, fault patterns). grep / rg returns lines.

grep is acceptable only for non-Rust files (TOML, YAML, Markdown) or quick one-off debugging.

The two measurement paths

CCPA's parity score is the output of two complementary measurement paths that cross-falsify each other.

Path 1 — Static (the meter)

fixtures/canonical/<id>/teacher.ccpa-trace.jsonl  ◄── AUTHORED
                                ▲
                                │  per-tool equivalence rules
                                │  + hook + skill projections
                                ▼
fixtures/canonical/<id>/student.ccpa-trace.jsonl  ◄── AUTHORED
                        │
                        ▼
            ccpa-differ::compute_parity_score
                        │
                        ▼
                    ParityReport
                  { score, drifts[] }

What it validates: the meter. Does the differ recognize equivalent actions? Does it catch the kinds of drift we care about? Does it ignore the noise we choose to ignore?
How it's wired: 30 canonical fixtures + a regression corpus (bidirectional sensitivity proof, M9) + per-PR CI hard-blocker (FALSIFY-CCPA-007 since M16).
What it cannot do: tell you whether apr code actually solves real tasks. Trace pairs are AUTHORED; they prove the differ logic, not the real-world capability gap.

Path 2 — Arena (the system)

fixtures/project-scale/<id>/{prompt.txt, cwd-tree/}
                        │
                        ▼
       Arena runner: live claude + live apr code
        (multi-turn, max_turns=20, wall=900s default)
                        │
                        ▼
            per-fixture oracle (cargo test 2>&1 | grep "test result: ok")
                        │
                        ▼
                    ArenaOutcome
            { OraclePassed | OracleFailedAfterMaxTurns
              | WallTimeout | DriverError | ComplianceFailed
              | ComplianceTrap | AgentTextLoop (M292) }
                        │
                        ▼
              evidence/phase-{5,6}/arena-scores.json

What it validates: the system. Does apr code solve real Rust bugs the way claude does?
How it's wired: multi-turn live subprocess dispatch. Operator-coordinated (requires claude login + a local GGUF model + GPU/CPU compute budget). Phase 5 (M194-M210) shipped the project-scale corpus; Phase 6 (M250+) adds the under-contract dispatch (per-turn pmat comply check --strict to measure compliance cost).
What it cannot do: tell you that the differ logic is right. Arena measures end-to-end behavior, not action-stream equivalence.

Why both?

Each path has a different failure mode that the other catches:

Static path alone would let apr code "pass" by producing traces that look like claude's but cover none of the real-world capability surface. A perfect 1.0 parity score on a curated corpus means nothing if apr code can't solve a real bug.
Arena path alone would let apr code "pass" by producing solutions that happen to work but via wildly different action sequences (e.g., a single 5000-line file_write vs. claude's careful read-edit-test loop). Outcome parity ≠ action parity; both matter.

FALSIFY-CCPA-019 (calibration_required_before_verdict) and FALSIFY-CCPA-016 (outcome_parity_bound) jointly enforce that the two paths' verdicts must agree, or the disagreement must be calibrated and explained.

When the paths disagree — the Popperian discipline

The M234 finding (phase-5 results) was a clean Popperian-falsification of the static-fixture approach as a project-scale predictor:

Static path: 1.0000 parity on canonical corpus (n=30, M150-M161)
Arena path: claude 1/5, apr code 0/5 on phase-5 project-scale corpus (M234)

Direction agrees (claude > apr code), magnitude diverges (1.0 vs 0.0 on Arena despite 1.0 on static). The static result over-predicts at project-scale. This is recorded in docs/specifications/completeness-assessment.md and the Arena scores are the ground-truth for project-scale claims.

Architecture at a glance

Workspace layout

claude-code-parity-apr/
├── contracts/                 # pin.lock + smoke YAML; canonical YAML lives in aprender
├── crates/
│   ├── ccpa-trace/            # JSONL trace schema, types, validators
│   ├── ccpa-differ/           # per-tool equivalence rules, parity score
│   ├── ccpa-recorder/         # stream-json parser (claude side)
│   ├── ccpa-subproc/          # subprocess driver (deterministic stdout/stderr capture)
│   ├── ccpa-replayer/         # mock harness for replay determinism
│   ├── ccpa-arena/            # multi-turn live runner + bench binary
│   └── ccpa-cli/              # `ccpa` user-facing binary
├── docs/specifications/       # 25 spec files (all <500 LOC, doc-drift gated)
├── evidence/                  # per-phase measured-output snapshots
├── fixtures/                  # canonical, regression, project-scale, calibration-and-scale, under-contract
└── scripts/                   # bench dispatch + drift detectors

Crate dependency graph

                       ccpa-cli
                          │
            ┌─────────────┼─────────────┐
            ▼             ▼             ▼
       ccpa-differ    ccpa-arena   ccpa-recorder
            │             │             │
            └─────────────┼─────────────┘
                          ▼
                     ccpa-trace
                          │
                          ▼
                     ccpa-subproc

ccpa-trace is the schema root — every crate consumes its Trace, Record, ToolUse, ToolResult types. Adding a new trace record kind goes here first; the schema bump cascades downward through compile-time type checks.

How `ccpa diff` produces a parity score

Load both JSONL files via ccpa-trace::parse::parse_file. The parser hard-enforces schema v2 (hook_event + skill_invocation records added at M15).
Pair records by index. Length must match exactly (records imbalance is a hard error — see tool_call_equivalence falsifier).
Project hook events and skill invocations onto their target tool record (M15 hook/skill semantics).
Match each paired record under its per-tool equivalence rule:
- Bash: command tokenization + whitelist of allowed nondeterminism
- Write/Edit: post-state file SHA256 must agree
- Read: path + range + content excerpt
- Skill: invocation site + arguments
- Hook: trigger + target tool's invocation
Score: count matches, divide by total. Score ∈ [0.0, 1.0].
Categorize drifts: any mismatch is classified into a closed DriftCategory enum. Tier 0 = no drift; Tier 1 = cosmetic; Tier 2 = semantic; Tier 3 = sovereignty violation (see crates/ccpa-differ/src/sovereignty.rs).
Report: ParityReport { score, drifts[] } — JSON-serializable, the unit of measurement.

How `ccpa-arena-bench` runs a fixture

1. Copy fixture's cwd-tree to /tmp/p6-uc-<fixture>-<side>.<rand>
2. Read prompt.txt
3. Launch driver subprocess:
     - teacher: claude --output-format=stream-json --print "<prompt>"
     - student: apr code --model=<path> -p "<prompt>" + apr serve auto-spawned
4. Multi-turn loop (max_turns=20 default, wall=900s default):
   a. Render history into prompt suffix
   b. driver.next_turn(prompt + history) → NextTurn { blocks, stop_reason }
   c. Extract first ToolUse block → dispatch in fixture cwd
   d. Append TurnRecord to history
   e. Every K turns (oracle_check_interval=3 default) OR on EndTurn:
      - Run oracle: cargo test 2>&1 | grep "test result: ok"
      - Pass → return OraclePassed
   f. Phase 6 only: if compliance_enforced, per-Write/Edit run pmat comply check
   g. Trap detectors: ComplianceTrap (N consecutive same-(file,sha) failures),
      AgentTextLoop (N consecutive text-only turns, M292, opt-in)
5. On max_turns / wall / driver_error / compliance_trap → return the appropriate ArenaOutcome
6. Emit BenchResult JSON to evidence/<phase>/captures/<fixture>/<side>.bench.json

The cleanly-typed outcome enum lets aggregate scoring (recovery_rate, oracle_passed_rate, compliance_cost_ratio) pattern-match without parsing strings.

Two binaries, one config space

ccpa — user-facing CLI for the static path (diff, corpus, coverage, validate)
ccpa-arena-bench — Arena dispatcher (operator-coordinated)

Both consume the same Trace/ArenaOutcome types and emit the same JSON shapes downstream tools depend on.

Trace schema

The trace schema is the language CCPA speaks. Everything — the differ, the Arena runner, the replayer — operates on Trace objects: a sequence of Record types each describing one observable action.

The 7 record kinds (schema v2)

Kind	Fields	When emitted
`session_start`	`session_id`, `cwd`, `git_commit`	First record of every trace
`user_prompt`	`text`, `attachments[]`	User-initiated turn
`assistant_turn`	`text`, `blocks[]`, `stop_reason`	Model response
`tool_result`	`tool_use_id`, `content`, `is_error`	Tool execution result
`session_end`	`reason`	Last record (clean shutdown or interrupt)
`hook_event`	`hook_name`, `trigger`, `tool_use_id?`	Hook fired (schema v2, M15)
`skill_invocation`	`skill_name`, `args`	Skill invoked (schema v2, M15)

assistant_turn.blocks[] is a polymorphic array — each block is one of:

Text { text } — model output text
ToolUse { id, name, input } — a tool call (Bash, Read, Write, Edit, Glob, Grep, Shell, ...)
Thinking { text } — extended thinking (claude-only; optional)

The Rust types are mirrored in crates/ccpa-trace/src/lib.rs; the JSON-schema is in contracts/claude-code-parity-apr-v1.yaml § trace_schema.

File format — JSONL (one record per line)

{"kind":"session_start","session_id":"abc-123","cwd":"/tmp/fixture-0001","git_commit":"deadbeef"}
{"kind":"user_prompt","text":"Fix the failing test."}
{"kind":"assistant_turn","blocks":[{"type":"text","text":"I'll start by reading the file."},{"type":"tool_use","id":"tu_1","name":"Read","input":{"path":"src/lib.rs"}}],"stop_reason":"tool_use"}
{"kind":"tool_result","tool_use_id":"tu_1","content":"<file contents>","is_error":false}
...
{"kind":"session_end","reason":"end_turn"}

JSONL means line-oriented, append-only, streamable. The parser at ccpa-trace::parse::parse_file is O(n) and emits structured errors with line numbers.

Roundtrip falsifier — `FALSIFY-CCPA-001`

Every record kind has a roundtrip test: serialize → parse → re-serialize → compare. If any field is lossy or any field re-orders, the roundtrip falsifier catches it.

17 pin tests in crates/ccpa-trace/tests/falsify_ccpa_001_roundtrip.rs.

Schema versioning

v1 (M0-M14): 5 record kinds (session_start, user_prompt, assistant_turn, tool_result, session_end).
v2 (M15+): adds hook_event and skill_invocation. The differ's hook/skill projection rules require these.

Schema bumps follow the Methodology cycle — contract YAML first, then tests, then code.

The differ

ccpa-differ is the heart of the static path. It takes two traces — teacher and student — and produces a ParityReport with a score and a list of DriftCategory entries.

Entry point — `compute_parity_score`

use ccpa_differ::{compute_parity_score, ParityReport};
use ccpa_trace::Trace;

let teacher: Trace = ccpa_trace::parse_file("teacher.ccpa-trace.jsonl")?;
let student: Trace = ccpa_trace::parse_file("student.ccpa-trace.jsonl")?;

let report: ParityReport = compute_parity_score(&teacher, &student);
println!("score = {}, drifts = {}", report.score, report.drifts.len());

Per-tool equivalence rules

The differ's behavior is dispatched on ToolUse.name:

Tool	Rule
`Bash` / `Shell`	Tokenize command; whitelist allowed nondeterminism (`mktemp -p` paths, ISO-8601 timestamps, PID); compare token sequences
`Read`	Path equal (after canonicalization) + range overlap; content excerpt SHA256 equal
`Write`	Path equal; post-state file SHA256 equal (the file mutation IS the equivalence claim)
`Edit`	Path equal; old/new strings equal; post-state file SHA256 equal
`Glob`	Pattern equal; result-count equal modulo cwd; result-paths SHA256-equal
`Grep`	Pattern equal; flag equivalence; result line-count equal
`Hook`	Trigger equal; target tool's invocation equal
`Skill`	Name equal; args structurally equal

Each rule is one Rust function in crates/ccpa-differ/src/; adding a tool requires (1) the rule, (2) a falsifier test, (3) a contract YAML entry.

DriftCategory — the closed enum

pub enum DriftCategory {
    Tier0NoDrift,
    Tier1Cosmetic { detail: String },        // whitespace, timestamp jitter
    Tier2Semantic { detail: String },        // different file content, different command
    Tier3SovereigntyViolation { detail: String },  // network egress, foreign-API call
}

Tier3 is the hardest gate. A Tier3 drift means apr code did something that breaks the sovereignty contract (any network call to a non-localhost endpoint outside the allow-list, any read of an environment variable that contains credentials, any subprocess spawn outside the cwd, etc.). Even one Tier3 drift hard-fails CI.

How the score is computed

total_pairs = teacher.records.len()                  # must equal student.records.len()
matches     = pairs where DriftCategory == Tier0NoDrift
score       = matches / total_pairs                  # ∈ [0.0, 1.0]

The threshold for FALSIFY-CCPA-008 (parity_score_bound) is configured in the contract YAML; current canonical-corpus threshold is ≥ 0.95 (with 30 fixtures, this means at most 1 fixture can have any drift).

Corpus driver — `ccpa corpus`

ccpa corpus fixtures/canonical/                 # walks every fixture, computes per-fixture + aggregate score
ccpa corpus fixtures/regression/                # MUST FAIL (bidirectional sensitivity proof)
ccpa corpus fixtures/canonical/ --json          # machine-readable for CI

Aggregate scoring respects FALSIFY-CCPA-007 (corpus coverage): every required-row of the apr-code-parity-v1.yaml parity matrix must have at least one fixture exercising it. Missing coverage → exit 2 with a structured error pointing at the gap.

What the differ does NOT do

Does not run code. It reads two traces; that's it. The Arena runner is for live execution.
Does not infer intent. "Same effect, different tool" is not equivalence under CCPA. If teacher did Edit and student did Write-the-whole-file, those are different actions, even if the post-state file SHA256 is identical. The contract gates the action stream, not just the file system.
Does not allow nondeterminism by default. Each whitelist of allowed nondeterminism is per-tool, explicit, and contract-gated. Adding a new whitelist entry requires a contract bump.

Fixtures

CCPA has five distinct fixture corpora, each measuring a different thing.

1. `fixtures/canonical/` — the meter

30 fixtures, every required-row of apr-code-parity-v1.yaml exercised at least once.
AUTHORED teacher/student trace pairs.
MUST score ≥ threshold in ccpa corpus. Per-PR CI hard-blocker via FALSIFY-CCPA-007.
Aggregate parity = 1.0000 at canonical corpus (M150, fixtures/canonical/measured-parity.json).

2. `fixtures/regression/` — bidirectional sensitivity proof

Fixtures with deliberate drift — teacher and student diverge in known ways.
MUST FAIL ccpa corpus. If a regression fixture passes, the differ has lost sensitivity to that drift class.
Catches "the meter agrees on everything" bugs (M9 introduced this corpus).

3. `fixtures/project-scale/` — Phase 5 Arena corpus

5 real GitHub-issue Rust fixtures with full cwd-tree/, prompt.txt, oracle.
Each fixture is a real Rust bug or feature request that an agent must solve in a multi-turn session.
M234 finding: claude 1/5, apr code 0/5. Direction agrees with static verdict; magnitudes diverge.

4. `fixtures/calibration-and-scale/` — synthetic-deterministic project-scale

15 hand-authored Rust bug fixtures.
Deterministic seed; reproducible from clean clone.
Bridges the static path (controlled) and project-scale Arena (real-world) via a controlled Arena-style measurement.

5. `fixtures/under-contract/` — Phase 6 corpus

20 fixtures across 4 classes: leetcode, oo (OO patterns), transpile (format converters), unix (CLI utilities).
Each runs under the Phase 6 compound oracle: cargo test AND pmat comply check --strict.
The corpus that V1_004 dispatches against.

Fixture file layout

fixtures/canonical/0001-edit-readme/
├── meta.toml                       # fixture id, covers[], description
├── teacher.ccpa-trace.jsonl        # AUTHORED teacher action stream
└── student.ccpa-trace.jsonl        # AUTHORED student action stream

fixtures/under-contract/leetcode/01-two-sum/
├── prompt.txt                      # the task description shown to both agents
├── meta.toml                       # oracle_cmd, expected_pattern
└── cwd-tree/
    ├── Cargo.toml
    ├── src/lib.rs                  # the buggy code
    └── tests/...

Adding a fixture

mkdir fixtures/canonical/00XX-my-scenario

cat > fixtures/canonical/00XX-my-scenario/meta.toml <<EOF
[fixture]
id = "00XX-my-scenario"
covers = ["builtin-tools-rwegs"]
description = "What this fixture exercises and why."
EOF

# Author teacher.ccpa-trace.jsonl + student.ccpa-trace.jsonl

ccpa corpus fixtures/canonical/                            # MUST exit 0
ccpa coverage --apr-code-parity-yaml ... --oos-rows ...    # MUST exit 0
make tier3                                                 # full local gate sweep

Coverage gates fail if a fixture is added without a covers[] claim or if covers[] contains a row not in apr-code-parity-v1.yaml. The contract YAML drives fixture validation, not the other way around.

Bidirectional sensitivity

A parity meter has two failure modes:

False positive — declaring drift when traces are actually equivalent. Caught by the canonical corpus (fixtures/canonical/ MUST PASS).
False negative — declaring equivalence when traces actually diverge. Caught by the regression corpus (fixtures/regression/ MUST FAIL).

A meter that passes only the canonical corpus is not validated. It may be passing everything trivially. The regression corpus is the falsifier for the differ itself.

What "bidirectional" means here

The differ must be sensitive in both directions:

                   teacher == student (equivalent)
                              │
                              ▼
                       parity_score == 1.0
                              │
                       (canonical corpus
                        proves this direction)


                   teacher != student (deliberate drift)
                              │
                              ▼
                       parity_score < threshold
                              │
                       (regression corpus
                        proves this direction)

If either direction breaks, the meter is broken. The regression corpus exists because in M9 we caught a class of drift the differ wasn't sensitive to — the canonical corpus passed, but a known-bad pair also passed. That's a Tier 2 meter bug. Bidirectional sensitivity is the falsifier for it.

The M196-M224 bug stack

Through M196-M224 the team encountered four meter bugs in a row, each caught only by bidirectional sensitivity:

Bash command tokenization — cargo test --release and cargo test tokenized identically (the regression fixture for this case exposed it).
Glob result-set hashing — glob.results[] was being compared as a set, not a sequence, allowing reordered results to slip through.
Hook trigger projection — PreToolUse and PostToolUse hooks were collapsing onto the same target.
Sovereignty check ordering — Tier3 detection ran AFTER score computation, so a sovereignty violation could silently lower the score below threshold without being categorically flagged.

Each was caught by a regression fixture that the canonical corpus didn't catch. The four-bug stack is the empirical justification for FALSIFY-CCPA-019 (calibration_required_before_verdict) — every Arena verdict requires a fresh bidirectional sensitivity record on file.

The calibration contract — `FALSIFY-CCPA-019`

Shipped at M236. Codifies the M196-M224 lesson as a permanent gate:

no Arena verdict ships without a CalibrationRecord stamped within the last 90 days

The CalibrationRecord JSON shape lives in crates/ccpa-differ/src/calibration.rs. Each record contains: (a) canonical-corpus passes, (b) regression-corpus fails, (c) Tier3 sovereignty exercises, (d) cross-tool equivalence spot-checks. A stale record fails CI on the next Arena dispatch.

This is the only FALSIFY-CCPA- gate that fires on a measured artifact (a JSON file with a timestamp), not on a code-level test. It's the closest thing CCPA has to a runtime-only contract — and it's there for a hard-earned reason.

Arena runner overview

The Arena is CCPA's live-execution path. It dispatches real claude and real apr code subprocesses against real Rust bugs in real cwd-trees, and scores each via a test-shaped oracle.

The Arena loop (per fixture, per side)

1. Copy fixture's cwd-tree to /tmp/p6-uc-<fixture>-<side>.<rand>
2. Read prompt.txt
3. Launch driver subprocess via SubprocessDriver:
     teacher: claude --output-format=stream-json --print "<prompt>"
     student: apr code --model=<path> -p "<prompt>"  (apr serve auto-spawned)
4. Multi-turn ArenaSession::run loop:
   for turn in 1..=max_turns:
     a. Check wall-clock budget
     b. Render history into prompt suffix:
          "<prompt>\n\n<rendered_history>### Continue:\n"
     c. driver.next_turn(prompt) → NextTurn { blocks, stop_reason }
     d. Extract first ToolUse block from blocks:
          some → dispatch the tool in cwd, record ToolResult
          none → record ToolInvocation::Text
     e. Phase 6 only: ComplianceTrap detector observes ToolResult::FileMutated
     f. M292: AgentTextLoop detector observes ToolInvocation::Text
     g. Append TurnRecord to history
     h. Every oracle_check_interval turns OR on StopReason::EndTurn:
          run_oracle_compound → OracleOutcome { Passed | FailedDueToCompliance | NonZeroExit | ExitZeroNoPatternMatch }
          Passed → return ArenaOutcome::OraclePassed
          FailedDueToCompliance (Phase 6) → return ArenaOutcome::ComplianceFailed
   end for
5. Loop exit → ArenaOutcome::OracleFailedAfterMaxTurns
6. Wall-time exit → ArenaOutcome::WallTimeout
7. Driver error → ArenaOutcome::DriverError { reason, turns_before_error }
8. Compliance trap → ArenaOutcome::ComplianceTrap { file, last_reason, consecutive_count }
9. Text loop (M292) → ArenaOutcome::AgentTextLoop { consecutive_text_turns, last_text_excerpt }

Default knobs

Knob	Default	Set by
`max_turns`	20	`PHASE6_MAX_TURNS` env / `--max-turns` flag
`max_wall_seconds`	900 (phase 5) / 3600 (phase 6)	`PHASE6_WALL_SECONDS` / `--wall-seconds`
`oracle_check_interval`	5 (phase 5) / 3 (phase 6)	`PHASE6_ORACLE_INTERVAL` / `--oracle-check-interval`
`compliance_enforced`	`false` (phase 5) / `true` (phase 6)	`PHASE6_COMPLIANCE_ENFORCED` / `--compliance-enforced`
`max_consecutive_compliance_failures`	3	`PHASE6_MAX_CONSECUTIVE_COMPLIANCE_FAILURES`
`max_consecutive_text_turns` (M292)	0 (disabled)	`PHASE6_MAX_CONSECUTIVE_TEXT_TURNS`

Two binaries

ccpa-arena-bench (in crates/ccpa-arena/src/bin/) — one-fixture dispatcher. Reads prompt, oracle config, driver config from flags; emits BenchResult JSON.
scripts/phase-{5,6}-bench.sh — corpus-walker that calls ccpa-arena-bench per-fixture, aggregates per-side scores.json.

The shell script handles model pre-warming, evidence directory layout, and per-fixture orchestration; the Rust binary handles the loop.

Phase 5 — project-scale Arena

Phase 5 (M194-M210) was the first Arena dispatch against real GitHub-issue Rust fixtures. It produced the Popperian-falsification result that established project-scale measurement as the ground truth.

Corpus

fixtures/project-scale/ — 5 real Rust bug fixtures hand-curated from GitHub issues:

Each fixture has a cwd-tree/ (a snapshot of the repo at the buggy commit), a prompt.txt (the issue text or a derived task), and a test-shaped oracle (cargo test + an expected pattern).
Fixtures span error-handling, async edge cases, FFI boundaries, lifetime issues, and macro-related bugs.

Headline result (M234)

Side	Oracle pass	Recovery (one bash-fail then pass)	Recovery rate
`claude` (teacher)	1/5	1	1.00 (1 of 1 passes had recovery)
`apr code` (student)	0/5	0	undefined (0/0)

apr code's 0/5 was uniform OracleFailedAfterMaxTurns — the agent engaged but couldn't solve the bugs within the 20-turn / 900s budget.

What M234 falsified

The static-fixture parity score of 1.0000 on the canonical corpus (fixtures/canonical/, n=30, M150) does NOT predict project-scale Arena performance. The two systems are functionally interchangeable on single-prompt code generation (HumanEval-class) but diverge on multi-turn project-scale work.

Per the Popperian discipline, this is a clean falsification, not a contradiction. Both measurements are valid; they measure different things. The static path measures the meter; the Arena path measures the system.

docs/specifications/completeness-assessment.md is the honest record of this. The README's "honest framing" paragraph quotes the same finding.

Why the Arena bench is operator-coordinated

A full Arena run consumes:

claude API costs (one paid claude --print invocation per turn × up to 20 turns × 5 fixtures × 2 dispatches per measurement)
Local GPU/CPU compute for apr code's apr serve (GGUF model loaded into VRAM/RAM)
A claude login session that must not be reused across machines or breached by intermediate proxies

These costs are externalized — CI dispatches static-path tests only. Arena dispatches are operator-dispatched, evidence-captured, and stamped into evidence/phase-5/arena-scores.json. This is contract-gated by FALSIFY-CCPA-019 (calibration_required_before_verdict).

Sub-deliverables (P5.1-P5.5)

P5.1 (M194-M196) — ArenaSession scaffolding type
P5.2 (M197-M210) — multi-turn loop body, tool dispatch, oracle integration, MockDriver for tests
P5.3 (M211-M222) — corpus walker (ccpa-arena-bench), aggregate scoring, recovery_rate
P5.4 (M223-M228) — bidirectional sensitivity calibration + the M196-M224 4-bug stack closure
P5.5 (M229-M234) — first end-to-end Arena dispatch + scores.json + Popperian-falsification finding

Phase 6 — under-contract dispatch

Phase 6 (M250+) extends the Arena to measure not just "did the agent solve the bug?" but "did the agent solve the bug in a compliance-respecting way?"

What "under contract" means

In Phase 5, the only oracle is cargo test. An agent can pass that oracle while emitting code that violates pmat comply check --strict (the project's quality posture: complexity caps, lint rules, allowed-unwrap policy, etc.).

In Phase 6, the oracle is compound:

oracle_passed iff (cargo_test_exit_code == 0
                   AND grep "test result: ok" in test output
                   AND pmat comply check --strict exit_code == 0)

pmat comply runs at the end of the session AND after every Write / Edit if --compliance-enforced is set (per-turn compliance gating).

The four Phase-6-specific outcomes

Outcome	When
`ComplianceFailed { check, turn }`	Cargo test passed, but final-state compliance check rejected. Distinct from `OracleFailedAfterMaxTurns`.
`ComplianceTrap { file, last_reason, consecutive_count }`	Same `(file, sha256)` failed compliance N turns in a row (default 3). Saves token cost.
`AgentTextLoop { consecutive_text_turns, last_text_excerpt }` (M292)	N consecutive text-only turns (no tool_call). Opt-in via `PHASE6_MAX_CONSECUTIVE_TEXT_TURNS > 0`.
`OraclePassed` (Phase 6 sense)	BOTH cargo test AND `pmat comply check --strict` pass.

The V1 falsifiers added at Phase 6

ID	Name	Status	Asserted by
`V1_001`	`qwen3_moe_serve_dispatch_v1`	ACTIVE_RUNTIME	`aprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs`
`V1_002`	`qwen3_moe_sampling_v1`	ACTIVE_RUNTIME	sampling integration tests
`V1_003`	`qwen3_moe_streaming_sse_v1`	DISCHARGED on gx10 Blackwell	streaming SSE test + evidence
`V1_004`	`phase_6_bench_non_zero_student_pass_rate`	open	per-fixture `student_pass_rate > 0`

Current state of V1_004

V1_004 is the OPEN gate. The bar: "ANY single Phase 6 fixture passes the compound oracle on the student side."

The M286-M294 chain has shipped 6 aprender PRs + 4 CCPA PRs working toward V1_004 discharge:

M286 — M32d MoE KV cache (19× speedup; the load-bearing inference infrastructure)
M287 — greedy baseline confirms M287 driver_error pattern (model entered "Human:" infinite loop)
M288-M290 — diagnosed 3 root causes; shipped sampling (temperature/top_k/top_p), repetition penalty, EOS stop_token, clean_chat_output, few-shot CODE_SYSTEM_PROMPT
M291 — sub-bench B on Qwen3-Coder-30B-A3B with all fixes: pattern shifted from driver_error to oracle_failed_after_max_turns with tool_use_count: 0
M292 — ArenaOutcome::AgentTextLoop detector + opt-in cap (Gap 3 closure)
M293 — PHASE6_MAX_CONSECUTIVE_TEXT_TURNS env var wiring
M294 — scope doc for the non-Coder Qwen3-30B-A3B-Instruct-2507 A/B

See The V1_004 chain for the empirical narrative.

Phase 6 corpus — `fixtures/under-contract/`

20 fixtures across 4 classes:

leetcode (5) — algorithmic bugs: two_sum, valid-parentheses, longest-common-prefix, merge-sorted-arrays, binary-search
oo (5) — object-oriented Rust patterns: bank-account, library-borrowing, shape-hierarchy, observer-pattern, builder-pattern
transpile (5) — format converters: json-to-toml, csv-to-jsonl, markdown-to-html, ini-to-yaml, regex-to-glob
unix (5) — CLI utility reimplementations: wc, head, tail, cut, sort

Each fixture's meta.toml includes oracle_cmd = "cargo test 2>&1" and expected_pattern = "test result: ok". The compound oracle adds pmat comply check --strict on the post-mutation tree.

Outcome variants

ArenaOutcome is the closed enum capturing every way an Arena session can end. It's the unit aggregate scoring pattern-matches on.

The full enum (post-M292)

#[serde(tag = "kind", rename_all = "snake_case")]
pub enum ArenaOutcome {
    OraclePassed                  { turns: u32, wall_seconds: u64 },
    OracleFailedAfterMaxTurns     { turns: u32, partial_pass_rate: Option<f64> },
    WallTimeout                   { turns_at_timeout: u32, max_wall_seconds: u64 },
    DriverError                   { reason: String, turns_before_error: u32 },
    ComplianceFailed              { check: ComplianceCheck, turn: u32 },
    ComplianceTrap                { file: String, last_reason: String, consecutive_count: u32 },
    AgentTextLoop                 { consecutive_text_turns: u32, last_text_excerpt: String },
}

Decision matrix

Outcome	Means	What aggregate score should treat it as
`OraclePassed`	Agent fully solved the fixture. (Phase 6: AND compliance passed.)	`oracle_passed = true`
`OracleFailedAfterMaxTurns`	Agent engaged, but didn't solve within 20 turns.	`oracle_passed = false`
`WallTimeout`	Agent ran out of wall-clock budget mid-session.	`oracle_passed = false`
`DriverError`	Driver subprocess crashed / hung / lost connection.	`oracle_passed = false`, count as infrastructure failure
`ComplianceFailed` (Phase 6)	`cargo test` passed, `pmat comply check` rejected.	`oracle_passed = false`, count toward compliance_cost_ratio numerator
`ComplianceTrap` (Phase 6)	Same `(file, sha256)` failed N consecutive turns.	`oracle_passed = false`, count toward token-cost-avoidance
`AgentTextLoop` (M292, opt-in)	N consecutive text-only turns (no tool_call).	`oracle_passed = false`, agent didn't engage

Why this many variants

Each variant captures a distinct failure mode that the team has empirically observed and decided is worth distinguishing. Conflating them loses signal:

OracleFailedAfterMaxTurns says "the agent worked but produced wrong output." Diagnostic action: inspect history for off-by-one fixes, missing edge cases.
WallTimeout says "the agent worked too slowly." Diagnostic action: check inference tok/s, max_tokens cap, network latency.
DriverError says "the infrastructure broke." Diagnostic action: check apr serve crash logs, network, ports, GPU OOM.
ComplianceTrap says "the agent is stuck making the same violating edit." Diagnostic action: check whether the agent has the compliance rules in context.
AgentTextLoop says "the agent talked but didn't act." Diagnostic action: check tool_call format adherence (this is the M291 finding signature).

Before M292, all the "talked but didn't act" cases were OracleFailedAfterMaxTurns — conflated with "did real work but wrong answer." Adding the AgentTextLoop variant let us measure the difference cleanly.

How aggregate scoring uses outcomes

fn passed(&self) -> bool {
    matches!(self, Self::OraclePassed { .. })
}

fn compliance_failed(&self) -> bool {
    matches!(self,
        Self::ComplianceFailed { .. } | Self::ComplianceTrap { .. }
    )
}

recovery_rate (Phase 5) counts OraclePassed fixtures where the agent recovered from at least one non-zero exit. compliance_cost_ratio (Phase 6) is compliance_failed_under_contract / oracle_passed_baseline (i.e., what fraction of fixtures that pass uncontract'd would fail under-contract).

The 20 falsification gates

Every gate is encoded in contracts/claude-code-parity-apr-v1.yaml (canonical in aprender, pinned here via contracts/pin.lock). Every gate has:

A FALSIFY-CCPA-NNN ID
A short name
A status (PROPOSED / ACTIVE_ALGORITHM_LEVEL / ACTIVE_RUNTIME)
A test (or tests) that asserts the falsifier
A natural-language description of what would falsify the gate

Full table — 20 gates

Source-of-truth invariants (M0+)

ID	Name	Status	Mechanism
CCPA-009	`ci_main_branch_green`	ACTIVE_RUNTIME	branch protection requires `ci/gate`
CCPA-010	`pmat_comply_100pct`	ACTIVE_RUNTIME	`pmat comply check`: `is_compliant=true` ∧ 0 Fail checks
CCPA-011	`line_coverage_100pct`	ACTIVE_RUNTIME	`cargo llvm-cov`: 100% functions ∧ ≥99% lines
CCPA-012	`pv_contract_gate_on_commit`	ACTIVE_RUNTIME	pre-commit hook + CI `pv validate` + pin-check

Behavioral parity gates

ID	Name	Status	Asserted by
CCPA-001	`trace_schema_roundtrip`	ACTIVE_RUNTIME	`crates/ccpa-trace/tests/falsify_ccpa_001_roundtrip.rs` (17 tests)
CCPA-002	`replay_determinism`	ACTIVE_RUNTIME	`crates/ccpa-replayer/` (16 tests)
CCPA-003	`mock_completeness`	ACTIVE_RUNTIME	same harness
CCPA-004	`tool_call_equivalence`	ACTIVE_RUNTIME	`crates/ccpa-differ/tests/falsify_ccpa_004_tool_equivalence.rs` (36 tests)
CCPA-005	`file_mutation_equivalence`	ACTIVE_RUNTIME	`crates/ccpa-differ/tests/falsify_ccpa_005_file_mutation.rs` (15 tests)
CCPA-006	`sovereignty_on_replay`	ACTIVE_RUNTIME	`crates/ccpa-differ/tests/falsify_ccpa_006_sovereignty.rs` (10 tests)
CCPA-007	`corpus_coverage`	HARD-BLOCKING (M16)	tests + CI `ccpa coverage --oos-rows ...`
CCPA-008	`parity_score_bound`	ADVISORY (M230)	`crates/ccpa-differ/tests/falsify_ccpa_008_parity_score.rs` (24 tests)
CCPA-013	`first_recorded_parity_score`	DISCHARGED	`fixtures/canonical/measured-parity.json` (n=30, aggregate=1.0000)
CCPA-014	`os_event_parity_bound`	ACTIVE_RUNTIME	`crates/ccpa-differ/tests/falsify_ccpa_014_os_event_parity.rs`
CCPA-015	`os_trace_output_purity`	ACTIVE_RUNTIME	`crates/ccpa-subproc/tests/falsify_ccpa_015_output_purity.rs`
CCPA-016	`outcome_parity_bound`	ACTIVE_RUNTIME	`crates/ccpa-differ/tests/falsify_ccpa_016_outcome_parity.rs`
CCPA-017	`project_scale_parity_bound`	PROPOSED (v1.28.0)	`crates/ccpa-differ/tests/falsify_ccpa_017_project_scale_parity.rs`
CCPA-018	`arena_recovery_rate_bound`	PROPOSED (v1.29.0)	`crates/ccpa-arena/tests/falsify_ccpa_018_arena_recovery_rate.rs`
CCPA-019	`calibration_required_before_verdict`	PROPOSED (v1.32.0)	`crates/ccpa-differ/tests/falsify_ccpa_019_calibration.rs`
CCPA-020	`contract_compliance_per_turn`	PROPOSED (v1.32.0)	`crates/ccpa-arena/tests/falsify_ccpa_020_contract_compliance.rs`

Cross-reference per chapter

Source-of-truth invariants — the four M0+ gates that govern the project's own quality posture
Behavioral parity gates — the gates that govern what apr code ↔ claude parity means
Status flow — the PROPOSED → ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME transition table

Mechanically asserted

Every gate is enforced by pv validate per CLAUDE.md § "DOGFOOD pv, NEVER bash". pv is the dogfooded contract validator (binary from aprender-contracts-cli). Re-implementing what pv already does in bash/python is muda and is rejected. If pv validate rejects a contract, fix the contract or extend aprender-contracts/src/schema/.

Source-of-truth invariants

These four gates govern the project's OWN quality posture (not the claude ↔ apr code parity). They are the meta-gates that make the rest of the gates trustable.

CCPA-009 — `ci_main_branch_green`

What it asserts: every commit on main was produced by a PR that had a green CI run.

How it's enforced: GitHub branch protection on main requires the ci/gate check. Direct pushes to main are blocked. Force-pushes to main are blocked. Merges require either fast-forward from a green branch OR squash from an approved + green PR.

What would falsify: a commit on main without a green CI run.

CCPA-010 — `pmat_comply_100pct`

What it asserts: every commit on main has pmat comply check returning is_compliant=true AND zero Fail-status checks.

How it's enforced: pmat comply check runs in CI on every PR. Any non-compliant artifact (file with disallowed unwrap, complexity > cap, lint violation, etc.) fails the job.

What would falsify: a main-branch commit where pmat comply check reports any Fail-status check.

pmat comply is the project's quality posture meter. It's not just clippy — it's a multi-pass static analyzer with custom rules for the aprender org's conventions (allowed-unwrap categories, complexity caps, doc-coverage minimums, etc.).

CCPA-011 — `line_coverage_100pct`

What it asserts: 100% function coverage AND ≥99% line coverage across all crates.

How it's enforced: cargo llvm-cov in CI. The threshold was refined in v0.4.0 (M120) from "100% lines" to "100% functions AND ≥99% lines" — the relaxation acknowledges unreachable error-handling branches that are mechanically uncoverable.

What would falsify: a main-branch commit where cargo llvm-cov reports any function with 0% coverage OR line coverage below 99%.

CCPA-012 — `pv_contract_gate_on_commit`

What it asserts: every commit on main passed pv validate against the pinned contract YAML AND the contracts/pin.lock matches the canonical aprender source.

How it's enforced: a pre-commit hook (scripts/install-pv-hook.sh, hard-installed by make install-hooks) PLUS the CI pv validate job. Both must pass before merge.

What would falsify: a main-branch commit where pv validate rejects the contract YAML OR where contracts/pin.lock's sha256 doesn't match the aprender commit's contract YAML at the pinned commit.

Why these four

These are the trust roots of the rest of the gate hierarchy. If CCPA-009 fails, any other gate could be silently broken on main without notice. If CCPA-010 fails, the project's quality posture has drifted from the org's contract. If CCPA-011 fails, untested code is on main. If CCPA-012 fails, the contract YAML and the code are out of sync.

Per CLAUDE.md, these are the gates that "no code ships without."

Behavioral parity gates

These gates govern what apr code ↔ claude parity means. Each one is a falsifiable assertion about the action-stream equivalence between the two systems.

CCPA-001 — `trace_schema_roundtrip`

Asserts: every trace record kind serializes → parses → re-serializes → equals the original.

Why: a lossy schema would silently drop information that downstream parity computation depends on. Catches schema-bump regressions.

Tests: 17 pin tests in crates/ccpa-trace/tests/falsify_ccpa_001_roundtrip.rs.

CCPA-002 — `replay_determinism`

Asserts: replaying a recorded trace through ccpa-replayer::MockHarness produces byte-identical output across runs.

Why: nondeterminism in the replay path would invalidate any parity claim. Catches hidden time/random/PID dependencies.

Tests: 16 tests in crates/ccpa-replayer/.

CCPA-003 — `mock_completeness`

Asserts: the MockHarness covers every tool kind defined in the schema.

Why: an incomplete mock means some real-world traces can't be replayed. Catches gaps when new tools are added.

CCPA-004 — `tool_call_equivalence`

Asserts: per-tool equivalence rules are deterministic, total functions over (teacher.input, student.input) pairs.

Why: the heart of the parity score. If the equivalence rule for Bash (say) has a bug, the score is meaningless.

Tests: 36 tests in crates/ccpa-differ/tests/falsify_ccpa_004_tool_equivalence.rs. One test per (tool, equivalence-class) pair.

CCPA-005 — `file_mutation_equivalence`

Asserts: a Write and an Edit that produce the same post-state file SHA256 are equivalent at the file-mutation level.

Why: enables the differ to recognize "same effect, different tool" as equivalent at the file level (separately from the action-stream level).

Tests: 15 tests in crates/ccpa-differ/tests/falsify_ccpa_005_file_mutation.rs.

CCPA-006 — `sovereignty_on_replay`

Asserts: Tier3 SovereigntyViolation fires deterministically on any trace that performs a network egress to a non-localhost endpoint outside the allow-list, OR reads a credential-bearing env var.

Why: the sovereignty contract is the hardest gate. False negatives here are catastrophic.

Tests: 10 tests in crates/ccpa-differ/tests/falsify_ccpa_006_sovereignty.rs.

CCPA-007 — `corpus_coverage` (HARD-BLOCKING since M16)

Asserts: every required-row of apr-code-parity-v1.yaml has at least one fixture exercising it.

Why: prevents the meter from being valid on a curated subset of the parity surface only. New rows in apr-code-parity-v1.yaml MUST come with a fixture.

Tests: 15 tests + per-PR CI ccpa coverage --apr-code-parity-yaml ... --oos-rows ....

CCPA-008 — `parity_score_bound` (ADVISORY, M230)

Asserts: canonical corpus aggregate parity score ≥ threshold (currently ≥ 0.95).

Why: the differ's output IS the parity score; this is the corpus-level acceptance bound.

Status: ADVISORY since M230 — the threshold was relaxed because of the M196-M224 4-bug stack revealed that "always 1.0 on canonical" was actually evidence of meter under-sensitivity, not perfect performance.

Tests: 24 tests in crates/ccpa-differ/tests/falsify_ccpa_008_parity_score.rs.

CCPA-013 — `first_recorded_parity_score` (DISCHARGED)

Asserts: a first measured aggregate parity score on the canonical corpus exists, dated, with n and aggregate recorded.

Status: DISCHARGED. fixtures/canonical/measured-parity.json (n=30, aggregate=1.0000).

CCPA-014 — `os_event_parity_bound`

Asserts: OS-level events (file opens, process spawns, stat calls) recorded on teacher and student match, modulo allowed nondeterminism whitelist.

Why: catches "same tool input, different OS effects" drift.

CCPA-015 — `os_trace_output_purity`

Asserts: subprocess stdout/stderr captures are byte-pure (no PID injection, no timestamp jitter introduced by the capture machinery).

Why: if the capture itself adds nondeterminism, every downstream comparison is wrong.

CCPA-016 — `outcome_parity_bound`

Asserts: per-fixture oracle_passed outcomes agree at corpus-level rate ≥ threshold.

Why: outcome parity (did both systems solve the bug?) is the project-scale analog of action parity. Necessary for the M234 Popperian-falsification claim to be sharp.

CCPA-017 — `project_scale_parity_bound` (PROPOSED, v1.28.0)

Asserts: project-scale Arena verdict on phase-5 corpus must match the static-fixture verdict in direction (not magnitude).

Why: M234 showed magnitudes diverge (1.0 vs 0.0 / 0.0); direction agreement (claude > apr code) is the falsifiable part.

CCPA-018 — `arena_recovery_rate_bound` (PROPOSED, v1.29.0)

Asserts: apr code recovery_rate (fraction of OraclePassed fixtures with at least one non-zero exit recovered) bounded below by threshold.

Why: a 0% recovery rate signals the agent doesn't retry meaningfully; threshold gate codifies the expectation.

CCPA-019 — `calibration_required_before_verdict` (PROPOSED, v1.32.0)

Asserts: no Arena verdict ships without a fresh CalibrationRecord (≤90 days old) on file.

Why: codifies M196-M224 four-bug stack lesson. See Bidirectional sensitivity.

CCPA-020 — `contract_compliance_per_turn` (PROPOSED, v1.32.0)

Asserts: in Phase 6 dispatch, per-turn pmat comply check fires after every Write/Edit; the agent SEES compliance results in next-turn history.

Why: makes the under-contract regime mechanically distinguishable from the control regime. Without this gate, "under contract" could silently degrade to "same as control."

Status flow — PROPOSED → ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME

Every gate has a status. The status reflects the strength of the evidence that the gate is correctly asserting what it claims.

The three statuses

`PROPOSED`

The gate is defined in the contract YAML.
No test asserts it yet (or tests exist but don't pass deterministically).
A grep + structural search confirms the gate has a body in the YAML, but the assertion is not yet mechanical.
CI may print "WARNING: gate-X is PROPOSED" but does not block on it.

`ACTIVE_ALGORITHM_LEVEL`

A deterministic, repeatable test asserts the gate.
The test passes on every CI run.
But no measured discharge has been recorded — i.e., no operator has dispatched a real bench against real systems and stamped the result into evidence/.
The gate is algorithm-validated but not empirically validated.

`ACTIVE_RUNTIME`

A measured discharge exists in evidence/ with a date, an n, and an aggregate score.
The gate is now both algorithm-validated AND empirically validated.
This is the highest status; gates that reach ACTIVE_RUNTIME are the project's hardest evidence.

Transition rules

                +-------------+
                |  PROPOSED   |
                +------+------+
                       |
                       |  (1) write a falsifier test
                       |  (2) test passes deterministically on CI
                       |  (3) flip status in contract YAML
                       ▼
            +-------------------------+
            | ACTIVE_ALGORITHM_LEVEL  |
            +------------+------------+
                         |
                         |  (1) operator dispatches a real bench
                         |  (2) evidence/<phase>/<artifact>.json captured
                         |  (3) calibration record on file (CCPA-019)
                         |  (4) flip status in contract YAML
                         ▼
                  +----------------+
                  | ACTIVE_RUNTIME |
                  +----------------+

Every transition is a YAML-level edit reviewed in PR, gated by pv validate, and asserted by FALSIFY-CCPA-012 (pv_contract_gate_on_commit).

Status distribution at v1.32.0

Status	Count	Gates
`ACTIVE_RUNTIME`	16	CCPA-001..006, 008..016 (minus DISCHARGED), 009..012
`PROPOSED`	4	CCPA-017, 018, 019, 020
`DISCHARGED`	1	CCPA-013 (`first_recorded_parity_score`, M150)

DISCHARGED is the terminal state — the gate's claim was empirically met, and the gate-as-assertion is preserved for historical record but no longer fires.

The V1_ gate prefix (Phase 6)

V1_001..V1_004 are distinct from CCPA-NNN. They live in aprender's contracts (qwen3_moe-serve-dispatch-v1.yaml et al.) and gate the infrastructure that V1_004 (Phase 6 student pass rate) depends on:

V1_001 — qwen3_moe serve dispatch (ACTIVE_RUNTIME)
V1_002 — sampling (temperature/top_k/top_p) (ACTIVE_RUNTIME)
V1_003 — streaming SSE (DISCHARGED on gx10 Blackwell)
V1_004 — Phase 6 non-zero student pass rate (open as of this writing)

Once V1_004 discharges, CCPA-017 (project_scale_parity_bound) becomes eligible to flip from PROPOSED to ACTIVE_ALGORITHM_LEVEL.

The V1_004 chain

V1_004 — "Phase 6 bench non-zero student pass rate against a Qwen3-Coder-30B-A3B-Instruct GGUF" — is the open gate. The chain of work toward discharging it has produced the most empirically interesting body of findings in CCPA's history.

This chapter is the canonical record of that chain.

The chain at a glance

M-row	Date (2026)	What it shipped
M280	05-19	Phase 6 SUSPENSION declared (1.5B model below testability floor)
M286	05-20	M32d MoE KV cache shipped (19× speedup on Qwen3-MoE)
M287	05-20	Greedy baseline: uniform `driver_error` ("Human:" infinite loop)
M288	05-20	Diagnosis: 3 root causes (no EOS stop_token, no clean_chat_output, no few-shot prompt)
M289	05-20	Plumbing shipped: 3-knob HTTP wire-up (`APR_AGENT_TEMPERATURE`, etc.)
M290	05-20	5-PR snapshot: aprender#1832, #1837, #1842, #1844, #1846 all merged
M291	05-21	sub-bench B pattern shift: `driver_error` → `oracle_failed_after_max_turns` (text-only loops, 0 tool_calls)
M292	05-21	`ArenaOutcome::AgentTextLoop` detector + 7 tests (Gap 3 closure)
M293	05-21	`PHASE6_MAX_CONSECUTIVE_TEXT_TURNS` env var wiring at script level
M294	05-22	Scope doc for non-Coder Qwen3-30B-A3B-Instruct-2507 A/B; download + smoke confirmed tool_call JSON emission

The hypothesis-evolution narrative

Hypothesis 1 (start of chain): inference stack is the bottleneck

Premise: V1_004 can't discharge because the apr serve inference path for qwen3_moe is too slow / too broken to fit 20 turns × 1024 max_tokens within a 60min wall budget.

Test: ship M32d MoE KV cache (19× speedup), enable 3-knob sampling, add EOS stop_token and clean_chat_output post-strip.

Result: the M287 driver_error pattern (infinite "Human:" loop) was broken. Sub-bench B on Qwen3-Coder-30B-A3B shifted to a diverse outcome distribution.

Conclusion: inference stack was a necessary but not sufficient fix.

Hypothesis 2 (M291): few-shot prompt is the bottleneck

Premise: the model is now finite-output (M287 runaway broken), but it emits Markdown rust blocks instead of <tool_call> JSON. Adding 3 concrete <tool_call> few-shot examples in CODE_SYSTEM_PROMPT (#1849) should override the Markdown prior.

Test: sub-bench B with #1849's few-shot prompt + 3-knob sampling + EOS + clean_chat_output.

Result: fixture 1 of sub-bench B → oracle_failed_after_max_turns turns=20, ALL 20 turns text-only, tool_use_count: 0. The prompt fix didn't shift behavior.

Conclusion: refuted. Few-shot examples didn't override the model's training distribution.

Hypothesis 3 (M291): active-params count is the bottleneck

Premise: Qwen3-Coder-30B-A3B is 30B-total / 3B-active (MoE routing). Maybe 3B active params is below the agentic-code floor. A dense 7B (Qwen2.5-Coder-7B-Instruct) with 2.3× more active params should fare better.

Test: 17/20 fixtures of Qwen2.5-Coder-7B-Instruct under same 3-knob config.

Result: 12× wall_timeout, 3× oracle_failed_after_max_turns, 2× driver_error, 0 oracle_passed, 0 tool_calls across all inspected fixtures. Same Markdown-block pattern.

Conclusion: refuted. Active params count isn't the variable.

Hypothesis 4 (M294, current): Qwen-Coder finetune family is the bottleneck

Premise: both tested models (Qwen3-Coder-30B-A3B and Qwen2.5-Coder-7B-Instruct) are Qwen-Coder finetunes. Maybe the Coder finetune family specifically has a sticky Markdown-block training prior. A non-Coder Instruct variant — same Qwen3-MoE architecture, same active-param count — should fare better.

Test: smoke Qwen3-30B-A3B-Instruct-2507 (non-Coder) with same CODE_SYSTEM_PROMPT + fixture 1 prompt.

Result: the model emitted {"name": "file_read", "input": {"path": "src/lib.rs"}} + </tool_call> in 20 completion tokens, finish_reason: stop. Categorically different from Coder family (which always emitted 500+ tokens of Markdown).

Conclusion: empirically confirmed at smoke level. Full bench corpus in progress as of 2026-05-22.

What this means for V1_004

V1_004's gate text names Qwen3-Coder-30B-A3B-Instruct specifically. A successful Qwen3-30B-A3B-Instruct-2507 (non-Coder) dispatch is diagnostic evidence, not a contract-level discharge of V1_004 as written.

The path forward, post-empirical-confirmation:

(a) Amend V1_004's gate text to allow any qwen3_moe architecture (via the M22 5-step ritual: contract bump in aprender → fixture update → coverage rerun → calibration record → CCPA-side mirror PR)
(b) OR propose a new gate (V1_005?) against the non-Coder variant
(c) OR engineer a post-decode Markdown→tool_call parser in apr code to unlock Qwen-Coder family for the existing V1_004 gate

This is an operator-coordinated decision tree. The empirical work has produced the evidence; the contract-level choice is upstream.

M286 — M32d MoE KV cache shipped

Date: 2026-05-20

aprender PR: #1832

What it shipped: forward_single_qwen3_moe_with_cache — a per-token cache-aware MoE forward path for the qwen3_moe architecture.

Why it was necessary

The original qwen3_moe inference path in apr serve was per-full-prompt: every new token required re-processing the entire context from scratch. For a 1024-token max-tokens cap on a 7-turn conversation (~3000 prompt tokens accumulated), this meant O(n²) work per turn.

Empirically: a single 20-turn fixture on Qwen3-Coder-30B-A3B at this regime took ~34min per turn on CPU. The M286 cache implementation cut it to ~6min per fixture (across all turns) — a 19× speedup.

What it changed structurally

old:  prompt → embed → 48× (attention + MoE FFN) → LM head → next_token
       (re-runs entire context every token)

new:  if first_token:
        prompt → embed → 48× (attention with cache.append + MoE FFN) → LM head → next_token
      else:
        last_token_embed → 48× (attention with cache.get_k/get_v GQA + MoE FFN) → LM head → next_token
       (only the new token is processed; cache provides past K/V)

The implementation lives in crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs. The single_cache_final_output helper (final norm + LM head) was bumped from private to pub(crate) to allow the MoE module to share it with the dense path.

Falsifiers shipped with it

qwen3-moe-serve-dispatch-v1 (V1_001) → ACTIVE_RUNTIME
moe_kv_cache_equivalence — numerical-equivalence test: cache-on vs cache-off forward passes produce identical logits modulo F32 precision
m32d_perf — ≥5 tok/s floor under CPU compute mode

Why this was the unlock

Without M286, V1_004 was a memory-cost problem (the test couldn't be run within reasonable wall-time on the operator's GPU/CPU budget). With M286, the wall-cost dropped 19×, enabling the empirical chain that followed (M287, M291, M294).

M286 is the load-bearing inference infrastructure for every Phase 6 dispatch.

M287 — greedy baseline pattern

Date: 2026-05-20

Bench wall: ~5hr (20 fixtures × ~15min each, with wall_seconds = 3600 per fixture)

Configuration

APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 \
bash scripts/phase-6-bench.sh

Greedy decoding (no temperature, no top_k, no top_p, no repetition penalty). Apr binary post-M32d but pre-3-knob plumbing.

Result

20/20 fixtures: uniform outcome=driver_error.

Student pass rate: 0/20 (0.00)
Teacher pass rate: 19/20 (0.95)
Recovery rate: 0.225

What the trace showed

Inspecting fixture 10's student.bench.json (oo__05-builder-pattern, 7 turns to driver_error):

turn 1 invocation:
  "Human: I need to see the full implementation..."
  Human: I need to see the full implementation...
  Human: I need to see the full implementation...
  Human: I need to see the full implementation...
  ...

The model emitted its own user-turn boundary ("Human:") repeatedly, never stopping. The text grew until the per-turn timeout (900s) fired. The driver then exited with the timeout error, which phase-6-bench.sh recorded as driver_error.

Root cause diagnosis (three independent gaps)

No EOS stop_token: try_qwen3_moe_backend in apr serve didn't populate QuantizedGenerateConfig.stop_tokens with the model's <|im_end|> EOS, so the decode loop ignored the natural turn boundary.
No post-decode cleanup: try_qwen3_moe_backend didn't call clean_chat_output to strip leaking "Human:" / "User:" / <|im_end|> prefixes — the runaway leaked into the captured chat response verbatim.
No format adherence guidance: CODE_SYSTEM_PROMPT described the <tool_call> format but gave no concrete examples. The 30B-Coder model's training distribution favored Markdown code blocks; without explicit examples it didn't emit <tool_call> JSON.

The dense GGUF path in apr serve handled (1) and (2) correctly; the MoE chat-backend path (added later for qwen3_moe) had a gap.

What M287 unlocked

The uniform driver_error pattern made the failure mode legible. Before M287, the assumption was "Qwen3-Coder-30B can't do agentic coding"; M287's evidence sharpened it to "the runaway is a fixable infrastructure issue, not a fundamental model limit."

The three gaps motivated M288-M290's 5-PR fix burst:

aprender#1832 — M32d KV cache (already merged)
aprender#1837 — qwen3-moe-sampling-v1 contract
aprender#1842 — sampling impl
aprender#1844 — repetition penalty
aprender#1846 — 3-knob HTTP wire-up (the operator-facing surface)
aprender#1849 — few-shot <tool_call> examples (Gap 3)
aprender#1852 — EOS stop_token + clean_chat_output (Gaps 1 + 2)
aprender#1853 — clean_chat_output start-of-string leading-prefix strip (M291 follow-on)

M291 — sub-bench B pattern shift

Date: 2026-05-21

Source PR: CCPA#259 (merged)

What changed from M287

	M287 (greedy)	M291 (sub-bench B)
Sampling	greedy (temp=0)	temp=0.3, top_k=50, top_p=0.95
Repetition penalty	none	repeat_penalty=1.2, repeat_last_n=64
EOS stop_token	NOT plumbed	`<
clean_chat_output	NOT called in MoE path	called via #1852
CODE_SYSTEM_PROMPT	no `<tool_call>` examples	3 concrete examples + anti-Markdown anti-rule via #1849

Result on fixture 1 (leetcode__01-two-sum)

Before: outcome=driver_error turns_before_error=7 (M287 pattern).

After: outcome=oracle_failed_after_max_turns turns=20.

{
  "outcome": { "kind": "oracle_failed_after_max_turns", "turns": 20 },
  "history_len": 20,
  "tool_use_count": 0,
  "kinds": [ { "k": "text", "n": 20 } ]
}

Every one of the 20 turns: text-only. No tool_call. result.kind: "skipped" across all 20.

Trace excerpt (fixture 1, turn 1)

Human: Here's what I have so far:

```rust
pub fn two_sum(nums: &[i32], target: i32) -> (usize, usize) {
    for i in 0..nums.len() {
        for j in (i + 1)..nums.len() {
            if nums[i] + nums[j] == target {
                return (i, j);
            }
        }
    }
    panic!("No two sum solution found");
}


The model's **code is functionally correct** (matches what the oracle expects: `return (i, j)`). But the fix is wrapped in a Markdown ```rust``` block, NOT in a `<tool_call>` JSON. The arena driver classifies it as a text-only turn, no file edit happens, no oracle re-runs.

## Three independent gaps surfaced

### Gap 1 — `clean_chat_output` start-of-string leak

`clean_chat_output`'s stop sequences anchor on `\nHuman:` / `\n\nHuman:` — requires a preceding newline. When the model leaks "Human:" at start-of-string (no newline before), the truncate-at-earliest loop misses it. Fixed in [aprender#1853](https://github.com/paiml/aprender/pull/1853).

### Gap 2 — few-shot prompt insufficient to override Markdown distribution

`CODE_SYSTEM_PROMPT` post-#1849 contains 3 concrete `<tool_call>` examples + explicit "DO NOT use Markdown ```rust``` code blocks" rule. Empirically, on Qwen3-Coder-30B, this guidance is over-ridden by the model's training distribution. **No PR closes this; it's a model-class-dependent finding.**

### Gap 3 — arena driver doesn't recover from skipped turns

Even if the model emitted `<tool_call>` in turn 1 and the file edit succeeded, fixture 1's oracle (cargo test) would have passed (the model's code is correct). But the arena driver doesn't recognize "0 tool_uses across 20 turns" as a stuck state — it just keeps prompting "Continue:" and the model keeps re-emitting variations of its already-correct code in Markdown form.

Fixed in [CCPA#260 (M292)](https://github.com/paiml/claude-code-parity-apr/pull/260): `ArenaOutcome::AgentTextLoop` variant + opt-in detector.

## Empirical conclusion (M291)

V1_004 is **partially discharged**: the M287 prerequisite-violation pattern (uniform `driver_error` from infinite "Human:" loop) is broken. The new pattern (`oracle_failed_after_max_turns` from training-distribution stickiness) is a **different class of failure** — finite, reproducible, debuggable.

V1_004 is **not fully discharged**: no fixture has yet shown `outcome=oracle_passed`. The bench continues; fixtures 2-20 reveal whether the pattern is uniform (training-distribution-locked across all task types) or sporadic (some fixtures elicit tool_call format).

M292 — Agent-Text-Loop detector

Date: 2026-05-21

Source PR: CCPA#260 (merged)

Companion PR: CCPA#261 (M293; env-var wiring)

What it adds

A new ArenaOutcome variant + an opt-in detector that catches the M291 failure signature (consecutive text-only turns) before the full 20-turn budget is consumed.

`ArenaOutcome::AgentTextLoop`

AgentTextLoop {
    consecutive_text_turns: u32,
    last_text_excerpt: String,    // first 200 chars of the most recent text turn
}

Captures the "talking but not acting" failure class distinctly from OracleFailedAfterMaxTurns.

`ArenaSession::with_max_consecutive_text_turns(cap)`

Builder method. cap=0 (default) disables the detector — preserves M287/M291 baseline behavior. Operators opt in per-run.

`AgentTextLoopState` rolling counter

Parallel to ComplianceTrapState. Pure logic:

Text invocation → increment counter, snapshot the excerpt.
Non-text invocation (Bash/Read/Write/Edit/etc.) → reset counter, clear excerpt.
When counter reaches cap → return AgentTextLoop outcome with current excerpt.

Test coverage (7 new tests)

agent_text_loop_state_increments_on_text — counter increments, trap fires at cap
agent_text_loop_state_resets_on_non_text — Bash invocation resets the counter; subsequent text starts at 1
agent_text_loop_state_excerpt_truncates_long_text — 500-char input → excerpt ≤200 chars + ellipsis
run_agent_text_loop_disabled_by_default_preserves_baseline — cap=0 (default) → text-only turns run to max_turns → OracleFailedAfterMaxTurns
run_agent_text_loop_fires_at_cap_when_enabled — 5 text turns with cap=3 → AgentTextLoop after turn 3; history has 3 records
run_agent_text_loop_resets_counter_on_tool_use — 2 text + 1 bash + 2 text + 1 bash pattern → no trap (counter resets twice) → runs to max_turns
with_max_consecutive_text_turns_accessor_returns_configured_cap + max_consecutive_text_turns_default_is_zero_disabled

All 146 ccpa-arena lib tests still pass.

Opt-in by design

The detector defaults to cap=0 (disabled) because:

Existing benches in evidence/under-contract*/ should remain comparable to new runs — turning the detector on by default would change outcome distributions for control comparisons.
Future operators may want to test agents at the full 20-turn budget for non-V1_004 reasons (e.g., turn-cost ratio measurement).
Phase 6 compliance_cost_ratio aggregate sums over a specific set of outcome variants; adding a new one to the default execution path could silently change the aggregate.

Operator interface (M293)

scripts/phase-6-bench.sh now reads PHASE6_MAX_CONSECUTIVE_TEXT_TURNS (default 0 = disabled). When > 0, threads --max-consecutive-text-turns=N into the ccpa-arena-bench invocation.

# Default — baseline behavior, no detector
bash scripts/phase-6-bench.sh

# Opt in — bail at 5 consecutive text-only turns
PHASE6_MAX_CONSECUTIVE_TEXT_TURNS=5 bash scripts/phase-6-bench.sh

Why this matters

Before M292, the M291 failure signature ("agent emits text for all 20 turns, never invokes a tool") was conflated with OracleFailedAfterMaxTurns — same outcome variant as "agent worked but produced wrong output." That conflation lost signal.

After M292, an operator inspecting scores.json can distinguish:

OracleFailedAfterMaxTurns → agent tried, wrong output
AgentTextLoop → agent didn't engage at all

This is the kind of diagnostic precision that lets the next experiment be designed correctly (the M294 finetune-A/B was scoped specifically because M291's text-loop signature is what M292 measures).

What this does NOT do

Doesn't auto-enable in scripts/phase-6-bench.sh (operator decision per-run).
Doesn't change compliance_cost_ratio / recovery_rate semantics (AgentTextLoop counts as "not oracle_passed", same as OracleFailedAfterMaxTurns).
Doesn't discharge V1_004 — student_pass_rate > 0 is still the bar.

M294 — finetune-distribution A/B

Date: 2026-05-22

Source PR: CCPA#262 (scope doc)

The hypothesis (refined to its sharpest form)

Through M286-M293 + the 17/20 Qwen2.5-Coder-7B-Instruct follow-on, four candidate variables were tested as the load-bearing one behind the 0%-tool_call signature:

Variable	Test	Outcome
Inference stack quality	M286 KV cache + 3-knob + EOS + clean_chat_output	Necessary fix; not sufficient
Active params count	3B (30B-A3B-MoE) vs 7B (dense 7B-Coder)	Both show same 0 tool_calls — refuted
MoE vs dense	qwen3_moe (30B-A3B) vs qwen2 (7B-dense)	Both show same pattern — refuted
Few-shot prompt examples	3 concrete `<tool_call>` examples + anti-Markdown rule	No shift in pattern — refuted

The remaining variable: Qwen-Coder finetune family specifically. Both tested models (Qwen3-Coder-30B-A3B + Qwen2.5-Coder-7B-Instruct) share the Coder-specific finetune.

The hypothesis being tested at M294: hold architecture, size, inference stack constant; vary only the finetune. Specifically: swap Qwen3-Coder-30B-A3B-Instruct for Qwen3-30B-A3B-Instruct-2507 (non-Coder, same MoE arch, same size, same active params, broader instruction + tool-use training distribution).

The smoke test (one-shot, no full bench)

While downloading the 18GB Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf, the operator pointed out that waiting 40 minutes for fixture 1 was unnecessary — a single targeted smoke against the exact same system prompt + user prompt the bench would use would give the answer in 30 seconds.

The smoke payload:

System: full CODE_SYSTEM_PROMPT (the same one in apr code, with the 3 <tool_call> few-shot examples and anti-Markdown rule)
User: fixture 1 (leetcode__01-two-sum) prompt
Config: temp=0.3, top_k=50, top_p=0.95, repeat_penalty=1.2, repeat_last_n=64 (sub-bench B config)
max_tokens: 400

The response:

{"name": "file_read", "input": {"path": "src/lib.rs"}}
</tool_call>

20 completion tokens
finish_reason: "stop"
Structured JSON tool_call (missing leading <tool_call> tag, but the body is exactly what the parser expects)
No "Human:" leak, no Markdown rust block, no rambling

Empirical conclusion

The Coder-finetune-distribution hypothesis is empirically confirmed at the smoke level. The non-Coder Instruct variant emits structured tool_call JSON in 20 tokens; the Coder variant emits 500+ tokens of Markdown explanation.

Whether the full bench discharges V1_004 (i.e., oracle_passed > 0) depends on whether:

The arena parser handles the missing leading <tool_call> opening tag (bare JSON body)
The model maintains the tool_call format across all 20 turns of a fixture
The model's code quality is correct (separately from format adherence)

What M294 unblocks

If the full bench shows ≥1 oracle_passed:

V1_004's open question is empirically answered: the bottleneck is finetune-distribution.
V1_004 as written names Qwen3-Coder-30B-A3B-Instruct specifically — a discharge requires either a contract amendment (M22 5-step ritual) or a new V1_005 gate.
M280 SUSPENSION can be lifted on a contract-level basis.

If the full bench still shows 0 oracle_passed:

The tool_call emission is necessary but not sufficient.
Code quality / correctness becomes the next variable to investigate.
A post-decode parser in apr code that converts Markdown rust blocks to file_edit calls becomes a higher-priority engineering target (which would unlock Qwen-Coder family for V1_004 as written).

CLI reference

`ccpa`

The user-facing CLI for the static path.

# Score a single teacher/student pair
ccpa diff fixtures/canonical/0001-edit-readme/teacher.ccpa-trace.jsonl \
          fixtures/canonical/0001-edit-readme/student.ccpa-trace.jsonl

# Score the whole corpus + bidirectional-sensitivity check
ccpa corpus fixtures/canonical/             # canonical MUST PASS
ccpa corpus fixtures/regression/            # regression MUST FAIL
ccpa corpus fixtures/canonical/ --json      # machine-readable

# Walk the parity-matrix coverage gate
ccpa coverage \
  --apr-code-parity-yaml ../aprender/contracts/apr-code-parity-v1.yaml \
  --fixtures-dir fixtures/canonical/ \
  --oos-rows keyboard-shortcuts,status-line

# Validate a JSONL trace against the schema
ccpa validate fixtures/canonical/0001-edit-readme/teacher.ccpa-trace.jsonl

`ccpa-arena-bench`

The Arena dispatcher (operator-coordinated).

ccpa-arena-bench \
  --cwd /tmp/p6-uc-leetcode__01-two-sum-student.xyz \
  --prompt-file fixtures/under-contract/leetcode/01-two-sum/prompt.txt \
  --oracle-cmd "cargo test 2>&1" \
  --oracle-pattern "test result: ok" \
  --max-turns 20 \
  --wall-seconds 3600 \
  --oracle-check-interval 3 \
  --driver-per-turn-timeout 900 \
  --compliance-enforced \
  --max-consecutive-compliance-failures 3 \
  --max-consecutive-text-turns 5 \
  --driver-binary /home/noah/.local/bin/apr \
  --driver-name apr \
  --driver-extra-arg code \
  --driver-extra-arg --model=/path/to.gguf

Outputs BenchResult JSON to stdout. Wrapped by the phase scripts.

`scripts/phase-{3,5,6}-bench.sh`

Operator-facing corpus walkers.

# Phase 3 — function-scale MultiPL-E-Rust HumanEval
bash scripts/phase-3-bench.sh

# Phase 5 — project-scale Arena (3 real GitHub-issue fixtures)
bash scripts/phase-5-arena-bench.sh

# Phase 5 — calibration-and-scale (15 synthetic-deterministic fixtures, M242)
bash scripts/phase-5-calibration-bench.sh

# Phase 6 — under-contract dispatch
APR_MODEL=/home/noah/models/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf \
  PHASE6_COMPLIANCE_ENFORCED=1 \
  PHASE6_MAX_TURNS=20 \
  PHASE6_WALL_SECONDS=3600 \
  APR_AGENT_TEMPERATURE=0.3 \
  APR_AGENT_TOP_K=50 \
  APR_AGENT_TOP_P=0.95 \
  APR_AGENT_REPEAT_PENALTY=1.2 \
  APR_AGENT_REPEAT_LAST_N=64 \
  PHASE6_MAX_CONSECUTIVE_TEXT_TURNS=5 \
  bash scripts/phase-6-bench.sh

Phase 6 environment variables

Env	Default	What it controls
`APR_MODEL`	`Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf`	GGUF path passed to `apr serve`
`APR_TIMEOUT_S`	900	Per-turn driver subprocess timeout
`APR_AGENT_HTTP_TIMEOUT_S`	1500	apr code → apr serve HTTP timeout
`APR_AGENT_MAX_TOKENS_CAP`	1024	Max tokens per assistant turn
`APR_AGENT_TEMPERATURE`	unset (greedy)	Sampling temperature
`APR_AGENT_TOP_K`	unset	Top-k filter
`APR_AGENT_TOP_P`	unset	Nucleus (top-p) filter
`APR_AGENT_REPEAT_PENALTY`	unset	Repetition penalty (Candle convention)
`APR_AGENT_REPEAT_LAST_N`	unset	Window for repetition penalty
`APR_AGENT_SEED`	random	Deterministic sampling seed
`PHASE6_MAX_TURNS`	20	Multi-turn cap
`PHASE6_WALL_SECONDS`	3600	Per-fixture wall-clock budget
`PHASE6_ORACLE_INTERVAL`	3	Oracle check cadence (turns)
`PHASE6_COMPLIANCE_ENFORCED`	1	Per-Write/Edit pmat comply check
`PHASE6_MAX_CONSECUTIVE_COMPLIANCE_FAILURES`	3	Compliance-Trap cap
`PHASE6_MAX_CONSECUTIVE_TEXT_TURNS` (M293)	0 (disabled)	Agent-Text-Loop cap

Local dev tier sweeps

make tier1          # fmt + clippy + check          (<5s)
make tier2          # tier1 + tests                 (<30s)
make tier3          # tier2 + cov + comply + pv     (1-3 min)
make install-hooks  # FALSIFY-CCPA-012 pre-commit hook
make install-tools  # local tools matching CI exactly

Trace JSON Schema reference

The full schema is in contracts/claude-code-parity-apr-v1.yaml § trace_schema. This page is a quick reference; the YAML is canonical.

Record kinds

// session_start — first record of every trace
{
  "kind": "session_start",
  "session_id": "string",
  "cwd": "/absolute/path",
  "git_commit": "deadbeef..."
}

// user_prompt — user-initiated turn
{
  "kind": "user_prompt",
  "text": "Fix the failing test.",
  "attachments": [/* optional */]
}

// assistant_turn — model response
{
  "kind": "assistant_turn",
  "blocks": [
    {"type": "text", "text": "I'll start by reading the file."},
    {"type": "tool_use", "id": "tu_1", "name": "Read", "input": {"path": "src/lib.rs"}}
  ],
  "stop_reason": "tool_use"  // or "end_turn", "max_tokens", "stop_sequence"
}

// tool_result — tool execution result
{
  "kind": "tool_result",
  "tool_use_id": "tu_1",
  "content": "<file contents>",
  "is_error": false
}

// session_end — last record
{
  "kind": "session_end",
  "reason": "end_turn"  // or "max_turns", "wall_timeout", "driver_error", etc.
}

// hook_event — hook fired (schema v2, M15)
{
  "kind": "hook_event",
  "hook_name": "pre-tool-use",
  "trigger": "PreToolUse",
  "tool_use_id": "tu_1"  // optional; null if pre-session
}

// skill_invocation — skill invoked (schema v2, M15)
{
  "kind": "skill_invocation",
  "skill_name": "explain",
  "args": {"depth": "medium"}
}

Block types (inside `assistant_turn.blocks[]`)

// Text — plain text output
{"type": "text", "text": "..."}

// ToolUse — a tool call
{"type": "tool_use", "id": "tu_<n>", "name": "Bash|Read|Write|Edit|...", "input": {...}}

// Thinking — extended thinking (claude-only; optional)
{"type": "thinking", "text": "..."}

stop_reason values

Value	Meaning
`tool_use`	Model emitted a tool_call; turn ends here
`end_turn`	Model's natural turn boundary (e.g., emitted EOS)
`max_tokens`	Hit the token cap
`stop_sequence`	Hit a configured stop sequence

Rust types

The Rust-side types are in crates/ccpa-trace/src/lib.rs:

pub struct Trace { pub records: Vec<Record> }

#[serde(tag = "kind", rename_all = "snake_case")]
pub enum Record {
    SessionStart { session_id: String, cwd: PathBuf, git_commit: String },
    UserPrompt { text: String, attachments: Vec<Attachment> },
    AssistantTurn { blocks: Vec<Block>, stop_reason: StopReason },
    ToolResult { tool_use_id: String, content: String, is_error: bool },
    SessionEnd { reason: SessionEndReason },
    HookEvent { hook_name: String, trigger: HookTrigger, tool_use_id: Option<String> },
    SkillInvocation { skill_name: String, args: serde_json::Value },
}

The roundtrip falsifier (FALSIFY-CCPA-001) asserts that every value serializes → parses → re-serializes losslessly.

Contract YAML reference

The canonical contract YAML lives in aprender:

Canonical: paiml/aprender/contracts/claude-code-parity-apr-v1.yaml
Pinned here: contracts/pin.lock — sha256 + commit reference

Pin format:

[pin]
aprender_commit = "16f25af06"
aprender_pr = 1078
aprender_pr_state = "OPEN"
contract_sha256 = "..."
last_synced = "2026-05-02"

Top-level structure

schema_version: "1.32.0"
name: "claude-code-parity-apr-v1"

gates:
  FALSIFY-CCPA-001:
    name: "trace_schema_roundtrip"
    status: "ACTIVE_RUNTIME"
    description: "..."
    asserted_by:
      - "crates/ccpa-trace/tests/falsify_ccpa_001_roundtrip.rs"

  FALSIFY-CCPA-NNN: { ... }

trace_schema:
  version: 2
  records:
    session_start: { ... }
    # ...

per_tool_equivalence:
  Bash: { ... }
  Read: { ... }
  Write: { ... }
  # ...

sovereignty:
  allowed_network_endpoints:
    - "127.0.0.1:*"
    - "localhost:*"
  forbidden_env_vars:
    - "ANTHROPIC_API_KEY"
    - "OPENAI_API_KEY"
    # ...

Validation — `pv validate`

pv is the dogfooded contract validator (aprender-contracts-cli). It enforces:

Schema correctness (every gate has the required fields)
Cross-reference correctness (asserted_by files exist)
Pin correctness (contracts/pin.lock's sha256 matches the aprender source at the pinned commit)

pv validate contracts/claude-code-parity-apr-v1.yaml
pv pin-check contracts/pin.lock --aprender-path ../aprender

CI runs both on every PR (FALSIFY-CCPA-012).

Adding a new gate

The M22 5-step ritual:

Propose — add the gate to the canonical aprender YAML at PROPOSED status. Open an aprender PR.
Test — write the falsifier test in the corresponding crate of this repo. PR against this repo.
Mirror — update contracts/pin.lock to the new aprender commit. PR (mechanical).
Verify — CI runs pv validate + pv pin-check + the new falsifier test on every PR. Both must be green.
Promote — once the test passes deterministically, flip status to ACTIVE_ALGORITHM_LEVEL (or ACTIVE_RUNTIME if backed by a measured discharge). PR.

Adding gates without all 5 steps is rejected. The ritual is pv validate-asserted; bypassing it is mechanical impossible.

Falsification gate IDs

Quick cross-reference. See The 20 gates for full descriptions.

CCPA prefix (this repo's gates)

ID	Name	Status
CCPA-001	`trace_schema_roundtrip`	ACTIVE_RUNTIME
CCPA-002	`replay_determinism`	ACTIVE_RUNTIME
CCPA-003	`mock_completeness`	ACTIVE_RUNTIME
CCPA-004	`tool_call_equivalence`	ACTIVE_RUNTIME
CCPA-005	`file_mutation_equivalence`	ACTIVE_RUNTIME
CCPA-006	`sovereignty_on_replay`	ACTIVE_RUNTIME
CCPA-007	`corpus_coverage`	HARD-BLOCKING (M16)
CCPA-008	`parity_score_bound`	ADVISORY (M230)
CCPA-009	`ci_main_branch_green`	ACTIVE_RUNTIME
CCPA-010	`pmat_comply_100pct`	ACTIVE_RUNTIME
CCPA-011	`line_coverage_100pct`	ACTIVE_RUNTIME
CCPA-012	`pv_contract_gate_on_commit`	ACTIVE_RUNTIME
CCPA-013	`first_recorded_parity_score`	DISCHARGED
CCPA-014	`os_event_parity_bound`	ACTIVE_RUNTIME
CCPA-015	`os_trace_output_purity`	ACTIVE_RUNTIME
CCPA-016	`outcome_parity_bound`	ACTIVE_RUNTIME
CCPA-017	`project_scale_parity_bound`	PROPOSED (v1.28.0)
CCPA-018	`arena_recovery_rate_bound`	PROPOSED (v1.29.0)
CCPA-019	`calibration_required_before_verdict`	PROPOSED (v1.32.0)
CCPA-020	`contract_compliance_per_turn`	PROPOSED (v1.32.0)

V1_ prefix (Phase 6 infrastructure gates, live in aprender)

ID	Name	Status
V1_001	`qwen3_moe_serve_dispatch_v1`	ACTIVE_RUNTIME
V1_002	`qwen3_moe_sampling_v1`	ACTIVE_RUNTIME
V1_003	`qwen3_moe_streaming_sse_v1`	DISCHARGED (gx10 Blackwell)
V1_004	`phase_6_bench_non_zero_student_pass_rate`	OPEN

Status legend

PROPOSED — defined, not yet algorithmically asserted
ACTIVE_ALGORITHM_LEVEL — algorithmically asserted, no measured discharge
ACTIVE_RUNTIME — algorithmically asserted AND measured discharge on file
DISCHARGED — empirical claim fully met; gate preserved for historical record but no longer fires
HARD-BLOCKING — CI exit-1 on failure (subset of ACTIVE_RUNTIME)
ADVISORY — emits warning, doesn't exit-1 (intentional after M230)

Academic basis

CCPA's design draws on several lines of prior work. Each is cited where its idea informs a specific gate or technique.

Distillation framing

Hinton et al., 1503.02531 — Distilling the Knowledge in a Neural Network

CCPA treats claude as the teacher and apr code as the student. The "knowledge" being distilled is the action stream — sequences of tool calls, not output logits. This generalizes the original logit-distillation framing to the agentic-execution setting.

Metamorphic testing of ML systems

Segura et al., 2208.08227 — METTLE: Metamorphic Testing of Deep Learning Systems

LLMORPH, 2603.23611 — Cataloged Metamorphic Relations for NLP

A metamorphic relation says: "if input X maps to output Y, then transformation T(X) should map to f(Y)." CCPA's per-tool equivalence rules are metamorphic relations specialized to action streams:

Bash(cmd) and Bash(canonical_form(cmd)) should produce equivalent file-system mutations
Write(path, content) and Edit(path, old, new) that produce the same file SHA256 are file-mutation-equivalent
etc.

The DriftCategory taxonomy maps onto Segura's metamorphic-violation severity scale.

Differential testing

2207.11976 — Differential Testing of Deep Learning Frameworks

CCPA is a differential test of apr code against claude — two implementations of the same logical specification (agentic coding), measured by paired-execution divergence. The static path's compute_parity_score IS a differential-testing scoring function.

Function-scale outcome parity

MultiPL-E, 2208.08227 — Cassano et al.

evidence/phase-3/multipl-e-rust-scores.json records the M150 function-scale measurement (n=5, parity=1.0000) using the MultiPL-E-Rust HumanEval subset. The benchmark is unmodified from upstream.

Project-scale Arena

SWE-bench, 2310.06770 — Jimenez et al.

SWE-bench formalized the "can LLMs resolve real GitHub issues" measurement at project-scale. CCPA's Phase 5 corpus is hand-curated in the SWE-bench style (real GitHub-issue Rust fixtures), but smaller (n=5) for operator-coordinated dispatch cost reasons. Phase 6's under-contract regime adds the compliance-cost dimension that SWE-bench doesn't address.

Chaos engineering for LLM systems

2505.03096 — Chaos Engineering for LLM Systems

CCPA's regression-corpus design (deliberate drift, must-fail) is in the spirit of chaos engineering: introduce a known failure mode and verify the meter catches it. The M196-M224 4-bug stack is the empirical justification for this practice.

Sovereignty / data-residency

No single paper drives the sovereignty gate (CCPA-006). The design is informed by the broader privacy-engineering literature on differential-privacy boundaries and the FedRAMP / HIPAA classes of "data must not leave the trust boundary" guarantees. The Tier3 SovereigntyViolation category formalizes the boundary.

Per-gate mapping

See docs/specifications/academic-basis.md for the per-gate citation table — every gate has a paper that motivated its design or that it specializes.

Milestone history

CCPA's work is organized as a continuous sequence of M-rows (milestone-rows) tracked in docs/specifications/milestones-*.md. Each M-row is one substantive deliverable (a PR, a fixture, a finding) with its own scope and acceptance criteria.

High-level phases

Phase	M-row range	What it shipped
Phase 1 (RECORD) — out-of-scope post-M222	M0-M14	original HTTPS-proxy recording path; rescoped to subprocess-driver
Phase 2 (REPLAY)	M15-M50	trace schema, replayer, mock harness, hook+skill projection
Phase 3 (DISTILL — function-scale)	M51-M100	MultiPL-E-Rust HumanEval bench, function-scale parity measurement (n=5, 1.0000)
Phase 4 (project-scale prep)	M101-M150	fixture authoring for project-scale; differ enhancements; bidirectional sensitivity
Phase 5 (ARENA — project-scale)	M150-M234	Arena runner, calibration-and-scale corpus, first arena scores
Phase 6 (UNDER-CONTRACT)	M250-M294	compliance-enforced dispatch, V1_004 chain, Coder-finetune-distribution finding

Notable M-rows

M9 — regression corpus added (bidirectional sensitivity)
M15 — schema v2 (hook_event + skill_invocation)
M16 — FALSIFY-CCPA-007 hard-blocking corpus coverage gate
M150 — first measured function-scale parity (n=5, 1.0000)
M194-M210 — Arena runner Phase 5 P5.1-P5.5
M222 — RECORD path out-of-scope directive (rescope to subprocess-driver only)
M230 — FALSIFY-CCPA-008 flipped to ADVISORY after M196-M224 four-bug-stack revealed meter under-sensitivity
M234 — Popperian-falsification of static-fixture as project-scale predictor (claude 1/5, apr code 0/5)
M236 — FALSIFY-CCPA-019 (calibration_required_before_verdict) introduced
M280 — Phase 6 CCPA project SUSPENSION declared (1.5B model below testability floor)
M286 — M32d MoE KV cache shipped (19× speedup; unblocks V1_004)
M287 — greedy baseline pattern; uniform driver_error on 30B-Coder
M291 — sub-bench B pattern shift; driver_error → oracle_failed_after_max_turns
M292 — ArenaOutcome::AgentTextLoop detector (Gap 3 closure)
M293 — PHASE6_MAX_CONSECUTIVE_TEXT_TURNS env var wiring
M294 — finetune-distribution A/B; non-Coder Qwen3-30B-A3B-Instruct-2507 confirmed at smoke level

How M-rows are tracked

Each M-row gets a row in docs/specifications/milestones-mNNN-mMMM.md. The row body explains:

What was shipped
Why (motivation, prior M-row references)
Acceptance criteria (tests, evidence, contract entries)
Cross-references (PR numbers, evidence file paths)

A doc-drift detector (scripts/check-doc-drift.sh) asserts that the milestone counter on 5 cross-reference surfaces (README, CONTRIBUTING, top spec, status-snapshots, milestones doc) all agree.

Operator-coordinated vs autonomous M-rows

Autonomous — anything that doesn't require operator-only data (compute budget, model-class decision, contract amendment). The autonomous ship-cycle (per CLAUDE.md) ships these continuously without check-in.
Operator-coordinated — anything that needs operator-only data: dispatching benches, deciding model class, amending contract gates. The substantive→mechanical→substantive cadence pauses ONLY for these.

Glossary

Term	Definition
Action stream	The sequence of tool calls + tool results + text + hooks + skills emitted by an agent during one session. CCPA's primary unit of measurement.
`apr code`	The student. A sovereign, pure-Rust CLI coding agent (in paiml/aprender) that runs against a local GGUF model with no data leaving the machine.
`apr serve`	Inference server subprocess that `apr code` auto-spawns and talks to over HTTP. Loads the GGUF model and serves `/v1/chat/completions`.
Arena	CCPA's live-execution measurement path. Multi-turn live dispatch of real teacher + real student against test-shaped oracles.
CCPA	Claude Code Parity for `apr code`. The harness this book describes.
`claude`	The teacher. Anthropic's official CLI (docs). Treated as the orchestrator and the action-stream baseline.
Closed enum	A Rust enum where adding a variant requires touching every match site. CCPA's `ArenaOutcome`, `DriftCategory`, `ToolInvocation` are closed enums by design — pattern-match exhaustiveness is the type system's enforcement of total handling.
Compound oracle	Phase 6 oracle: `cargo test` AND `pmat comply check --strict` both pass.
Compliance-Trap	M254 P6.3 detector. Bails the session with `ArenaOutcome::ComplianceTrap` when the same `(file, sha256)` pair fails compliance N consecutive turns. Saves token cost.
Driver	The subprocess wrapper around `claude` (teacher) or `apr code` (student). `SubprocessDriver` in `crates/ccpa-arena/`.
Drift / DriftCategory	A divergence between teacher and student traces. The closed enum (Tier0/1/2/3) categorizes severity.
Falsifier	A deterministic test that proves a gate. The gate states a falsifiable claim; the test would FAIL if the claim were wrong.
`FALSIFY-CCPA-NNN`	The unique identifier of a gate. Each ID maps to one entry in the contract YAML and one (or more) tests in the crates.
Fixture	A canonical input — typically `meta.toml` + (trace pairs OR cwd-tree + prompt + oracle). Lives in `fixtures/<corpus>/<id>/`.
Greedy	Sampling at `temperature=0`: always take the argmax of the next-token distribution. Deterministic but boring; can cause infinite loops.
M-row	One milestone in the project's continuous-ship cadence. Numbered M0, M1, ..., M294, ...
MoE	Mixture-of-Experts. A neural-architecture pattern where only a fraction of total parameters are "active" per token. Qwen3-Coder-30B-A3B is 30B total / 3B active.
Oracle	The test-shaped acceptance check for a fixture. Phase 5: `cargo test 2>&1
`pmat comply`	The paiml quality-posture meter. A multi-pass static analyzer with org-wide rules (allowed-unwrap, complexity caps, lint rules, doc coverage).
`pv`	The contract validator. Binary from `aprender-contracts-cli`. Asserts contract YAML correctness, pin correctness, gate cross-reference correctness. Dogfooded; bash re-implementations rejected.
`pv validate`	The `pv` subcommand that hard-asserts the contract YAML schema. CI-gated via `FALSIFY-CCPA-012`.
`pin.lock`	The pin from this repo to the canonical aprender contract YAML. Records sha256 + commit reference. Pin-check is part of `FALSIFY-CCPA-012`.
PROPOSED / ACTIVE_ALGORITHM_LEVEL / ACTIVE_RUNTIME	The three statuses of a gate. See Status flow.
Recovery rate	Fraction of OraclePassed fixtures where the agent recovered from at least one non-zero bash exit. Phase 5 metric.
Sovereignty / Tier3	The hardest gate class. A `Tier3 SovereigntyViolation` means the agent did something that breaches data residency / network sovereignty (egress, credential read, foreign API).
Sub-bench	A focused dispatch of the Phase 6 bench script with specific knob settings (e.g., sub-bench A = few-shot prompt only, sub-bench B = full 3-knob config).
Tool call / `<tool_call>` block	A JSON object inside a `<tool_call>...</tool_call>` XML-like wrapping. `apr code`'s parser extracts these from the model's response and dispatches the named tool.
Turn	One round of (assistant-emits-response, tool dispatched, result observed). The session loop runs up to `max_turns` of these.
V1_NNN	Phase 6 infrastructure gate prefix. Lives in aprender's contracts (distinct from CCPA-NNN).
Wall budget / wall_timeout	The wall-clock seconds budget for one session. Phase 5 default 900s; Phase 6 default 3600s. `WallTimeout` is the outcome when exceeded.