CCPA — The Claude Code Parity Harness

CCPA — record-replay-distill harness measuring claude vs apr code

A record-replay-distill harness measuring apr code against Claude Code at the action-stream level.

This book is the reference companion to the claude-code-parity-apr repository. It explains the methodology, the falsifier gates, the empirical findings, and the architectural decisions that shape every measurement.

Why this exists

A sovereign, locally-hosted coding agent (apr code) needs an honest, falsifiable yardstick to measure itself against the industry baseline (Claude Code). Without a rigorous yardstick:

  • "It works" claims drift from "it works like the reference"
  • Regressions hide behind narrative
  • The compliance posture of code an agent emits has no contract gate

CCPA closes that gap with three commitments:

  1. Contract-first. Every behavior gate (FALSIFY-CCPA-001..020) is encoded as a falsifiable assertion in a YAML contract before code lands. Tests prove the gate; pv validate proves the contract; pmat comply proves the project's compliance posture. No code ships without a contract.

  2. Two complementary measurement paths. A static path — authored teacher/student trace pairs scored by a deterministic differ — validates the meter. An Arena path — multi-turn live dispatches of real claude + real apr code against real Rust fixtures with test-shaped oracles — validates the system. The two paths cross-falsify each other.

  3. Empirical calibration. Every Arena verdict requires a fresh bidirectional-sensitivity calibration on file (FALSIFY-CCPA-019). Static-fixture parity is calibrated against project-scale Arena reality; any drift between them is recorded and explained.

Honest framing

At function-scale (single-prompt code generation on HumanEval-style fixtures), claude and apr code are functionally interchangeable — both pass each other's tests (1.0000 parity, n=5, M150).

At project-scale (multi-turn Arena with real GitHub-issue fixtures), the static-fixture approach is Popperian-falsified as a project-scale predictor: claude solves 1/5, apr code 0/5 on phase-5 corpus (M234). Direction agrees with static verdict, magnitudes diverge.

The empirical chain in this book — M1 → M294 — is the honest record of what we measured, when, and how confident we are. Negative results are evidence; this book treats them as such.

Status as of writing

  • Contract v1.32.0 — 20 gates registered (16 ACTIVE_RUNTIME, 4 PROPOSED)
  • M0 → M294 all SHIPPED
  • Phase 6 under-contract dispatch in active operator-coordinated bench cycles against Qwen3-30B-A3B-Instruct-2507
  • V1_004 (Phase 6 non-zero student pass rate) is the open gate

How to read this book

License

Apache-2.0 OR MIT. See the repository root.

What is CCPA?

CCPA — the Claude Code Parity for apr code harness — is a measurement system. It does one job: produce a falsifiable, contract-gated parity score between two AI coding agents.

  • Teacher (the reference): Claude Code — Anthropic's official CLI, treated as the orchestrator and the action-stream baseline.
  • Student (the sovereign system under test): apr code — a locally-hosted, pure-Rust coding agent that runs against a local GGUF model with no data leaving the machine.

What "parity" means here

Parity is not "the two systems produce identical bytes." Parity is action-stream semantic equivalence under a per-tool rule set.

For each pair of trace records — teacher and student — the differ asks:

  • Did they invoke the same logical tool? (BashBash, WriteWrite, etc.)
  • Did the tool inputs differ in ways that matter? (commands semantically equivalent? file paths normalized? content byte-equal or text-equivalent?)
  • Did the resulting file-system mutations agree? (hash-checked)
  • Did the OS-event trace agree, modulo allowed nondeterminism?

A parity score in [0.0, 1.0] plus a closed enum of DriftCategory for any mismatch is the output. The score and category are mechanically asserted by FALSIFY-CCPA-004 through FALSIFY-CCPA-008.

What CCPA is NOT

  • Not a benchmark suite for general LLMs. The corpus is curated for the apr codeclaude parity question. SWE-bench, HumanEval, and similar exist for general benchmarking.
  • Not a record-from-API tool. The original HTTPS-proxy recording path is intentionally out of scope post-M222 directive. claude is driven as a subprocess via session-based auth (claude login); CCPA does not use ANTHROPIC_API_KEY and does not call the Anthropic API directly.
  • Not a unit-test framework for claude. It's a parity harness — the meter between two systems.

Three deliverables, one repository

DeliverableWhat it isWhere it lives
The differccpa-differ crate + ccpa diff / ccpa corpus CLIcrates/ccpa-differ/
The Arena runnerccpa-arena crate + ccpa-arena-bench binarycrates/ccpa-arena/
The fixturesCanonical, regression, project-scale, calibration-and-scale, under-contractfixtures/

All three are governed by one contract YAML — see Methodology.

Methodology — contract-first + falsifier-driven

CCPA is governed by a single methodology, applied uniformly: every behavior gate is an assertion in a YAML contract; the assertion exists before the code that proves it; CI mechanically validates both.

The cycle

1. Behavior identified              →  written prose
2. Falsifier composed               →  "this is exactly the assertion that would
                                       prove the gate WRONG if it failed"
3. Contract entry added             →  contracts/claude-code-parity-apr-v1.yaml
                                       (status: PROPOSED at first)
4. pv validate the contract         →  syntax + schema gate
5. Test that exercises the falsifier→  crates/ccpa-{differ,arena,...}/tests/
                                       (links the gate ID by name)
6. CI hard-blocks                   →  status flips ACTIVE_ALGORITHM_LEVEL
                                       once the test passes deterministically
7. Empirical evidence on file       →  flips ACTIVE_RUNTIME once a real
                                       measured discharge is recorded

No step is optional. No step happens in a different order. The cycle is enforced by FALSIFY-CCPA-012 (pre-commit + CI pv validate) and FALSIFY-CCPA-007 (corpus coverage).

Status flow for any gate

PROPOSED  ──── algorithm-level test passes deterministically ────→  ACTIVE_ALGORITHM_LEVEL
                                                                              │
                                                              measured discharge on file
                                                                              ▼
                                                                       ACTIVE_RUNTIME
  • PROPOSED: defined in the YAML, not yet asserted by a passing test.
  • ACTIVE_ALGORITHM_LEVEL: a deterministic test asserts the gate, but no real-world measurement has been recorded yet.
  • ACTIVE_RUNTIME: a real measured bench run (operator-dispatched, evidence captured) discharged the gate.

See Status flow for the exhaustive transition table.

Three sources of truth

ConcernLives inWhy
Contract YAMLpaiml/aprender/contracts/claude-code-parity-apr-v1.yaml (canonical), pinned here via contracts/pin.lockaprender is the org-wide single-source-of-truth for paiml contracts
Spec textdocs/specifications/claude-code-parity-apr-poc.mdThis repo since M1
Implementation, fixtures, CI, coverage, pmat-complythis repoThe harness IS the implementation

The split mirrors aprender's monorepo policy: aprender stays canonical for contract TEXT (the shared schema across all paiml contracts), while this repo is canonical for runtime ENFORCEMENT (the tests, fixtures, CI, and pmat comply posture).

Forbidden tools

  • cargo tarpaulin — slow, unreliable. Use cargo llvm-cov only.
  • bash re-implementations of pv / pmat / cargo-llvm-cov checks — if pv validate rejects a contract, fix the contract or extend aprender-contracts/src/schema/; do not duplicate validation logic in bash.

Code search policy

pmat query over grep for any Rust code search. pmat query returns quality-annotated, semantically ranked results (TDG grades, complexity, fault patterns). grep / rg returns lines.

grep is acceptable only for non-Rust files (TOML, YAML, Markdown) or quick one-off debugging.

The two measurement paths

CCPA's parity score is the output of two complementary measurement paths that cross-falsify each other.

Path 1 — Static (the meter)

fixtures/canonical/<id>/teacher.ccpa-trace.jsonl  ◄── AUTHORED
                                ▲
                                │  per-tool equivalence rules
                                │  + hook + skill projections
                                ▼
fixtures/canonical/<id>/student.ccpa-trace.jsonl  ◄── AUTHORED
                        │
                        ▼
            ccpa-differ::compute_parity_score
                        │
                        ▼
                    ParityReport
                  { score, drifts[] }
  • What it validates: the meter. Does the differ recognize equivalent actions? Does it catch the kinds of drift we care about? Does it ignore the noise we choose to ignore?
  • How it's wired: 30 canonical fixtures + a regression corpus (bidirectional sensitivity proof, M9) + per-PR CI hard-blocker (FALSIFY-CCPA-007 since M16).
  • What it cannot do: tell you whether apr code actually solves real tasks. Trace pairs are AUTHORED; they prove the differ logic, not the real-world capability gap.

Path 2 — Arena (the system)

fixtures/project-scale/<id>/{prompt.txt, cwd-tree/}
                        │
                        ▼
       Arena runner: live claude + live apr code
        (multi-turn, max_turns=20, wall=900s default)
                        │
                        ▼
            per-fixture oracle (cargo test 2>&1 | grep "test result: ok")
                        │
                        ▼
                    ArenaOutcome
            { OraclePassed | OracleFailedAfterMaxTurns
              | WallTimeout | DriverError | ComplianceFailed
              | ComplianceTrap | AgentTextLoop (M292) }
                        │
                        ▼
              evidence/phase-{5,6}/arena-scores.json
  • What it validates: the system. Does apr code solve real Rust bugs the way claude does?
  • How it's wired: multi-turn live subprocess dispatch. Operator-coordinated (requires claude login + a local GGUF model + GPU/CPU compute budget). Phase 5 (M194-M210) shipped the project-scale corpus; Phase 6 (M250+) adds the under-contract dispatch (per-turn pmat comply check --strict to measure compliance cost).
  • What it cannot do: tell you that the differ logic is right. Arena measures end-to-end behavior, not action-stream equivalence.

Why both?

Each path has a different failure mode that the other catches:

  • Static path alone would let apr code "pass" by producing traces that look like claude's but cover none of the real-world capability surface. A perfect 1.0 parity score on a curated corpus means nothing if apr code can't solve a real bug.
  • Arena path alone would let apr code "pass" by producing solutions that happen to work but via wildly different action sequences (e.g., a single 5000-line file_write vs. claude's careful read-edit-test loop). Outcome parity ≠ action parity; both matter.

FALSIFY-CCPA-019 (calibration_required_before_verdict) and FALSIFY-CCPA-016 (outcome_parity_bound) jointly enforce that the two paths' verdicts must agree, or the disagreement must be calibrated and explained.

When the paths disagree — the Popperian discipline

The M234 finding (phase-5 results) was a clean Popperian-falsification of the static-fixture approach as a project-scale predictor:

  • Static path: 1.0000 parity on canonical corpus (n=30, M150-M161)
  • Arena path: claude 1/5, apr code 0/5 on phase-5 project-scale corpus (M234)

Direction agrees (claude > apr code), magnitude diverges (1.0 vs 0.0 on Arena despite 1.0 on static). The static result over-predicts at project-scale. This is recorded in docs/specifications/completeness-assessment.md and the Arena scores are the ground-truth for project-scale claims.

Architecture at a glance

Workspace layout

claude-code-parity-apr/
├── contracts/                 # pin.lock + smoke YAML; canonical YAML lives in aprender
├── crates/
│   ├── ccpa-trace/            # JSONL trace schema, types, validators
│   ├── ccpa-differ/           # per-tool equivalence rules, parity score
│   ├── ccpa-recorder/         # stream-json parser (claude side)
│   ├── ccpa-subproc/          # subprocess driver (deterministic stdout/stderr capture)
│   ├── ccpa-replayer/         # mock harness for replay determinism
│   ├── ccpa-arena/            # multi-turn live runner + bench binary
│   └── ccpa-cli/              # `ccpa` user-facing binary
├── docs/specifications/       # 25 spec files (all <500 LOC, doc-drift gated)
├── evidence/                  # per-phase measured-output snapshots
├── fixtures/                  # canonical, regression, project-scale, calibration-and-scale, under-contract
└── scripts/                   # bench dispatch + drift detectors

Crate dependency graph

                       ccpa-cli
                          │
            ┌─────────────┼─────────────┐
            ▼             ▼             ▼
       ccpa-differ    ccpa-arena   ccpa-recorder
            │             │             │
            └─────────────┼─────────────┘
                          ▼
                     ccpa-trace
                          │
                          ▼
                     ccpa-subproc

ccpa-trace is the schema root — every crate consumes its Trace, Record, ToolUse, ToolResult types. Adding a new trace record kind goes here first; the schema bump cascades downward through compile-time type checks.

How ccpa diff produces a parity score

  1. Load both JSONL files via ccpa-trace::parse::parse_file. The parser hard-enforces schema v2 (hook_event + skill_invocation records added at M15).
  2. Pair records by index. Length must match exactly (records imbalance is a hard error — see tool_call_equivalence falsifier).
  3. Project hook events and skill invocations onto their target tool record (M15 hook/skill semantics).
  4. Match each paired record under its per-tool equivalence rule:
    • Bash: command tokenization + whitelist of allowed nondeterminism
    • Write/Edit: post-state file SHA256 must agree
    • Read: path + range + content excerpt
    • Skill: invocation site + arguments
    • Hook: trigger + target tool's invocation
  5. Score: count matches, divide by total. Score ∈ [0.0, 1.0].
  6. Categorize drifts: any mismatch is classified into a closed DriftCategory enum. Tier 0 = no drift; Tier 1 = cosmetic; Tier 2 = semantic; Tier 3 = sovereignty violation (see crates/ccpa-differ/src/sovereignty.rs).
  7. Report: ParityReport { score, drifts[] } — JSON-serializable, the unit of measurement.

How ccpa-arena-bench runs a fixture

1. Copy fixture's cwd-tree to /tmp/p6-uc-<fixture>-<side>.<rand>
2. Read prompt.txt
3. Launch driver subprocess:
     - teacher: claude --output-format=stream-json --print "<prompt>"
     - student: apr code --model=<path> -p "<prompt>" + apr serve auto-spawned
4. Multi-turn loop (max_turns=20 default, wall=900s default):
   a. Render history into prompt suffix
   b. driver.next_turn(prompt + history) → NextTurn { blocks, stop_reason }
   c. Extract first ToolUse block → dispatch in fixture cwd
   d. Append TurnRecord to history
   e. Every K turns (oracle_check_interval=3 default) OR on EndTurn:
      - Run oracle: cargo test 2>&1 | grep "test result: ok"
      - Pass → return OraclePassed
   f. Phase 6 only: if compliance_enforced, per-Write/Edit run pmat comply check
   g. Trap detectors: ComplianceTrap (N consecutive same-(file,sha) failures),
      AgentTextLoop (N consecutive text-only turns, M292, opt-in)
5. On max_turns / wall / driver_error / compliance_trap → return the appropriate ArenaOutcome
6. Emit BenchResult JSON to evidence/<phase>/captures/<fixture>/<side>.bench.json

The cleanly-typed outcome enum lets aggregate scoring (recovery_rate, oracle_passed_rate, compliance_cost_ratio) pattern-match without parsing strings.

Two binaries, one config space

  • ccpa — user-facing CLI for the static path (diff, corpus, coverage, validate)
  • ccpa-arena-bench — Arena dispatcher (operator-coordinated)

Both consume the same Trace/ArenaOutcome types and emit the same JSON shapes downstream tools depend on.

Trace schema

The trace schema is the language CCPA speaks. Everything — the differ, the Arena runner, the replayer — operates on Trace objects: a sequence of Record types each describing one observable action.

The 7 record kinds (schema v2)

KindFieldsWhen emitted
session_startsession_id, cwd, git_commitFirst record of every trace
user_prompttext, attachments[]User-initiated turn
assistant_turntext, blocks[], stop_reasonModel response
tool_resulttool_use_id, content, is_errorTool execution result
session_endreasonLast record (clean shutdown or interrupt)
hook_eventhook_name, trigger, tool_use_id?Hook fired (schema v2, M15)
skill_invocationskill_name, argsSkill invoked (schema v2, M15)

assistant_turn.blocks[] is a polymorphic array — each block is one of:

  • Text { text } — model output text
  • ToolUse { id, name, input } — a tool call (Bash, Read, Write, Edit, Glob, Grep, Shell, ...)
  • Thinking { text } — extended thinking (claude-only; optional)

The Rust types are mirrored in crates/ccpa-trace/src/lib.rs; the JSON-schema is in contracts/claude-code-parity-apr-v1.yaml § trace_schema.

File format — JSONL (one record per line)

{"kind":"session_start","session_id":"abc-123","cwd":"/tmp/fixture-0001","git_commit":"deadbeef"}
{"kind":"user_prompt","text":"Fix the failing test."}
{"kind":"assistant_turn","blocks":[{"type":"text","text":"I'll start by reading the file."},{"type":"tool_use","id":"tu_1","name":"Read","input":{"path":"src/lib.rs"}}],"stop_reason":"tool_use"}
{"kind":"tool_result","tool_use_id":"tu_1","content":"<file contents>","is_error":false}
...
{"kind":"session_end","reason":"end_turn"}

JSONL means line-oriented, append-only, streamable. The parser at ccpa-trace::parse::parse_file is O(n) and emits structured errors with line numbers.

Roundtrip falsifier — FALSIFY-CCPA-001

Every record kind has a roundtrip test: serialize → parse → re-serialize → compare. If any field is lossy or any field re-orders, the roundtrip falsifier catches it.

17 pin tests in crates/ccpa-trace/tests/falsify_ccpa_001_roundtrip.rs.

Schema versioning

  • v1 (M0-M14): 5 record kinds (session_start, user_prompt, assistant_turn, tool_result, session_end).
  • v2 (M15+): adds hook_event and skill_invocation. The differ's hook/skill projection rules require these.

Schema bumps follow the Methodology cycle — contract YAML first, then tests, then code.

The differ

ccpa-differ is the heart of the static path. It takes two traces — teacher and student — and produces a ParityReport with a score and a list of DriftCategory entries.

Entry point — compute_parity_score

use ccpa_differ::{compute_parity_score, ParityReport};
use ccpa_trace::Trace;

let teacher: Trace = ccpa_trace::parse_file("teacher.ccpa-trace.jsonl")?;
let student: Trace = ccpa_trace::parse_file("student.ccpa-trace.jsonl")?;

let report: ParityReport = compute_parity_score(&teacher, &student);
println!("score = {}, drifts = {}", report.score, report.drifts.len());

Per-tool equivalence rules

The differ's behavior is dispatched on ToolUse.name:

ToolRule
Bash / ShellTokenize command; whitelist allowed nondeterminism (mktemp -p paths, ISO-8601 timestamps, PID); compare token sequences
ReadPath equal (after canonicalization) + range overlap; content excerpt SHA256 equal
WritePath equal; post-state file SHA256 equal (the file mutation IS the equivalence claim)
EditPath equal; old/new strings equal; post-state file SHA256 equal
GlobPattern equal; result-count equal modulo cwd; result-paths SHA256-equal
GrepPattern equal; flag equivalence; result line-count equal
HookTrigger equal; target tool's invocation equal
SkillName equal; args structurally equal

Each rule is one Rust function in crates/ccpa-differ/src/; adding a tool requires (1) the rule, (2) a falsifier test, (3) a contract YAML entry.

DriftCategory — the closed enum

pub enum DriftCategory {
    Tier0NoDrift,
    Tier1Cosmetic { detail: String },        // whitespace, timestamp jitter
    Tier2Semantic { detail: String },        // different file content, different command
    Tier3SovereigntyViolation { detail: String },  // network egress, foreign-API call
}

Tier3 is the hardest gate. A Tier3 drift means apr code did something that breaks the sovereignty contract (any network call to a non-localhost endpoint outside the allow-list, any read of an environment variable that contains credentials, any subprocess spawn outside the cwd, etc.). Even one Tier3 drift hard-fails CI.

How the score is computed

total_pairs = teacher.records.len()                  # must equal student.records.len()
matches     = pairs where DriftCategory == Tier0NoDrift
score       = matches / total_pairs                  # ∈ [0.0, 1.0]

The threshold for FALSIFY-CCPA-008 (parity_score_bound) is configured in the contract YAML; current canonical-corpus threshold is ≥ 0.95 (with 30 fixtures, this means at most 1 fixture can have any drift).

Corpus driver — ccpa corpus

ccpa corpus fixtures/canonical/                 # walks every fixture, computes per-fixture + aggregate score
ccpa corpus fixtures/regression/                # MUST FAIL (bidirectional sensitivity proof)
ccpa corpus fixtures/canonical/ --json          # machine-readable for CI

Aggregate scoring respects FALSIFY-CCPA-007 (corpus coverage): every required-row of the apr-code-parity-v1.yaml parity matrix must have at least one fixture exercising it. Missing coverage → exit 2 with a structured error pointing at the gap.

What the differ does NOT do

  • Does not run code. It reads two traces; that's it. The Arena runner is for live execution.
  • Does not infer intent. "Same effect, different tool" is not equivalence under CCPA. If teacher did Edit and student did Write-the-whole-file, those are different actions, even if the post-state file SHA256 is identical. The contract gates the action stream, not just the file system.
  • Does not allow nondeterminism by default. Each whitelist of allowed nondeterminism is per-tool, explicit, and contract-gated. Adding a new whitelist entry requires a contract bump.

Fixtures

CCPA has five distinct fixture corpora, each measuring a different thing.

1. fixtures/canonical/ — the meter

  • 30 fixtures, every required-row of apr-code-parity-v1.yaml exercised at least once.
  • AUTHORED teacher/student trace pairs.
  • MUST score ≥ threshold in ccpa corpus. Per-PR CI hard-blocker via FALSIFY-CCPA-007.
  • Aggregate parity = 1.0000 at canonical corpus (M150, fixtures/canonical/measured-parity.json).

2. fixtures/regression/ — bidirectional sensitivity proof

  • Fixtures with deliberate drift — teacher and student diverge in known ways.
  • MUST FAIL ccpa corpus. If a regression fixture passes, the differ has lost sensitivity to that drift class.
  • Catches "the meter agrees on everything" bugs (M9 introduced this corpus).

3. fixtures/project-scale/ — Phase 5 Arena corpus

  • 5 real GitHub-issue Rust fixtures with full cwd-tree/, prompt.txt, oracle.
  • Each fixture is a real Rust bug or feature request that an agent must solve in a multi-turn session.
  • M234 finding: claude 1/5, apr code 0/5. Direction agrees with static verdict; magnitudes diverge.

4. fixtures/calibration-and-scale/ — synthetic-deterministic project-scale

  • 15 hand-authored Rust bug fixtures.
  • Deterministic seed; reproducible from clean clone.
  • Bridges the static path (controlled) and project-scale Arena (real-world) via a controlled Arena-style measurement.

5. fixtures/under-contract/ — Phase 6 corpus

  • 20 fixtures across 4 classes: leetcode, oo (OO patterns), transpile (format converters), unix (CLI utilities).
  • Each runs under the Phase 6 compound oracle: cargo test AND pmat comply check --strict.
  • The corpus that V1_004 dispatches against.

Fixture file layout

fixtures/canonical/0001-edit-readme/
├── meta.toml                       # fixture id, covers[], description
├── teacher.ccpa-trace.jsonl        # AUTHORED teacher action stream
└── student.ccpa-trace.jsonl        # AUTHORED student action stream
fixtures/under-contract/leetcode/01-two-sum/
├── prompt.txt                      # the task description shown to both agents
├── meta.toml                       # oracle_cmd, expected_pattern
└── cwd-tree/
    ├── Cargo.toml
    ├── src/lib.rs                  # the buggy code
    └── tests/...

Adding a fixture

mkdir fixtures/canonical/00XX-my-scenario

cat > fixtures/canonical/00XX-my-scenario/meta.toml <<EOF
[fixture]
id = "00XX-my-scenario"
covers = ["builtin-tools-rwegs"]
description = "What this fixture exercises and why."
EOF

# Author teacher.ccpa-trace.jsonl + student.ccpa-trace.jsonl

ccpa corpus fixtures/canonical/                            # MUST exit 0
ccpa coverage --apr-code-parity-yaml ... --oos-rows ...    # MUST exit 0
make tier3                                                 # full local gate sweep

Coverage gates fail if a fixture is added without a covers[] claim or if covers[] contains a row not in apr-code-parity-v1.yaml. The contract YAML drives fixture validation, not the other way around.

Bidirectional sensitivity

A parity meter has two failure modes:

  1. False positive — declaring drift when traces are actually equivalent. Caught by the canonical corpus (fixtures/canonical/ MUST PASS).
  2. False negative — declaring equivalence when traces actually diverge. Caught by the regression corpus (fixtures/regression/ MUST FAIL).

A meter that passes only the canonical corpus is not validated. It may be passing everything trivially. The regression corpus is the falsifier for the differ itself.

What "bidirectional" means here

The differ must be sensitive in both directions:

                   teacher == student (equivalent)
                              │
                              ▼
                       parity_score == 1.0
                              │
                       (canonical corpus
                        proves this direction)


                   teacher != student (deliberate drift)
                              │
                              ▼
                       parity_score < threshold
                              │
                       (regression corpus
                        proves this direction)

If either direction breaks, the meter is broken. The regression corpus exists because in M9 we caught a class of drift the differ wasn't sensitive to — the canonical corpus passed, but a known-bad pair also passed. That's a Tier 2 meter bug. Bidirectional sensitivity is the falsifier for it.

The M196-M224 bug stack

Through M196-M224 the team encountered four meter bugs in a row, each caught only by bidirectional sensitivity:

  1. Bash command tokenizationcargo test --release and cargo test tokenized identically (the regression fixture for this case exposed it).
  2. Glob result-set hashingglob.results[] was being compared as a set, not a sequence, allowing reordered results to slip through.
  3. Hook trigger projectionPreToolUse and PostToolUse hooks were collapsing onto the same target.
  4. Sovereignty check orderingTier3 detection ran AFTER score computation, so a sovereignty violation could silently lower the score below threshold without being categorically flagged.

Each was caught by a regression fixture that the canonical corpus didn't catch. The four-bug stack is the empirical justification for FALSIFY-CCPA-019 (calibration_required_before_verdict) — every Arena verdict requires a fresh bidirectional sensitivity record on file.

The calibration contract — FALSIFY-CCPA-019

Shipped at M236. Codifies the M196-M224 lesson as a permanent gate:

no Arena verdict ships without a CalibrationRecord stamped within the last 90 days

The CalibrationRecord JSON shape lives in crates/ccpa-differ/src/calibration.rs. Each record contains: (a) canonical-corpus passes, (b) regression-corpus fails, (c) Tier3 sovereignty exercises, (d) cross-tool equivalence spot-checks. A stale record fails CI on the next Arena dispatch.

This is the only FALSIFY-CCPA- gate that fires on a measured artifact (a JSON file with a timestamp), not on a code-level test. It's the closest thing CCPA has to a runtime-only contract — and it's there for a hard-earned reason.

Arena runner overview

The Arena is CCPA's live-execution path. It dispatches real claude and real apr code subprocesses against real Rust bugs in real cwd-trees, and scores each via a test-shaped oracle.

The Arena loop (per fixture, per side)

1. Copy fixture's cwd-tree to /tmp/p6-uc-<fixture>-<side>.<rand>
2. Read prompt.txt
3. Launch driver subprocess via SubprocessDriver:
     teacher: claude --output-format=stream-json --print "<prompt>"
     student: apr code --model=<path> -p "<prompt>"  (apr serve auto-spawned)
4. Multi-turn ArenaSession::run loop:
   for turn in 1..=max_turns:
     a. Check wall-clock budget
     b. Render history into prompt suffix:
          "<prompt>\n\n<rendered_history>### Continue:\n"
     c. driver.next_turn(prompt) → NextTurn { blocks, stop_reason }
     d. Extract first ToolUse block from blocks:
          some → dispatch the tool in cwd, record ToolResult
          none → record ToolInvocation::Text
     e. Phase 6 only: ComplianceTrap detector observes ToolResult::FileMutated
     f. M292: AgentTextLoop detector observes ToolInvocation::Text
     g. Append TurnRecord to history
     h. Every oracle_check_interval turns OR on StopReason::EndTurn:
          run_oracle_compound → OracleOutcome { Passed | FailedDueToCompliance | NonZeroExit | ExitZeroNoPatternMatch }
          Passed → return ArenaOutcome::OraclePassed
          FailedDueToCompliance (Phase 6) → return ArenaOutcome::ComplianceFailed
   end for
5. Loop exit → ArenaOutcome::OracleFailedAfterMaxTurns
6. Wall-time exit → ArenaOutcome::WallTimeout
7. Driver error → ArenaOutcome::DriverError { reason, turns_before_error }
8. Compliance trap → ArenaOutcome::ComplianceTrap { file, last_reason, consecutive_count }
9. Text loop (M292) → ArenaOutcome::AgentTextLoop { consecutive_text_turns, last_text_excerpt }

Default knobs

KnobDefaultSet by
max_turns20PHASE6_MAX_TURNS env / --max-turns flag
max_wall_seconds900 (phase 5) / 3600 (phase 6)PHASE6_WALL_SECONDS / --wall-seconds
oracle_check_interval5 (phase 5) / 3 (phase 6)PHASE6_ORACLE_INTERVAL / --oracle-check-interval
compliance_enforcedfalse (phase 5) / true (phase 6)PHASE6_COMPLIANCE_ENFORCED / --compliance-enforced
max_consecutive_compliance_failures3PHASE6_MAX_CONSECUTIVE_COMPLIANCE_FAILURES
max_consecutive_text_turns (M292)0 (disabled)PHASE6_MAX_CONSECUTIVE_TEXT_TURNS

Two binaries

  • ccpa-arena-bench (in crates/ccpa-arena/src/bin/) — one-fixture dispatcher. Reads prompt, oracle config, driver config from flags; emits BenchResult JSON.
  • scripts/phase-{5,6}-bench.sh — corpus-walker that calls ccpa-arena-bench per-fixture, aggregates per-side scores.json.

The shell script handles model pre-warming, evidence directory layout, and per-fixture orchestration; the Rust binary handles the loop.

Phase 5 — project-scale Arena

Phase 5 (M194-M210) was the first Arena dispatch against real GitHub-issue Rust fixtures. It produced the Popperian-falsification result that established project-scale measurement as the ground truth.

Corpus

fixtures/project-scale/ — 5 real Rust bug fixtures hand-curated from GitHub issues:

  • Each fixture has a cwd-tree/ (a snapshot of the repo at the buggy commit), a prompt.txt (the issue text or a derived task), and a test-shaped oracle (cargo test + an expected pattern).
  • Fixtures span error-handling, async edge cases, FFI boundaries, lifetime issues, and macro-related bugs.

Headline result (M234)

SideOracle passRecovery (one bash-fail then pass)Recovery rate
claude (teacher)1/511.00 (1 of 1 passes had recovery)
apr code (student)0/50undefined (0/0)

apr code's 0/5 was uniform OracleFailedAfterMaxTurns — the agent engaged but couldn't solve the bugs within the 20-turn / 900s budget.

What M234 falsified

The static-fixture parity score of 1.0000 on the canonical corpus (fixtures/canonical/, n=30, M150) does NOT predict project-scale Arena performance. The two systems are functionally interchangeable on single-prompt code generation (HumanEval-class) but diverge on multi-turn project-scale work.

Per the Popperian discipline, this is a clean falsification, not a contradiction. Both measurements are valid; they measure different things. The static path measures the meter; the Arena path measures the system.

docs/specifications/completeness-assessment.md is the honest record of this. The README's "honest framing" paragraph quotes the same finding.

Why the Arena bench is operator-coordinated

A full Arena run consumes:

  • claude API costs (one paid claude --print invocation per turn × up to 20 turns × 5 fixtures × 2 dispatches per measurement)
  • Local GPU/CPU compute for apr code's apr serve (GGUF model loaded into VRAM/RAM)
  • A claude login session that must not be reused across machines or breached by intermediate proxies

These costs are externalized — CI dispatches static-path tests only. Arena dispatches are operator-dispatched, evidence-captured, and stamped into evidence/phase-5/arena-scores.json. This is contract-gated by FALSIFY-CCPA-019 (calibration_required_before_verdict).

Sub-deliverables (P5.1-P5.5)

  • P5.1 (M194-M196) — ArenaSession scaffolding type
  • P5.2 (M197-M210) — multi-turn loop body, tool dispatch, oracle integration, MockDriver for tests
  • P5.3 (M211-M222) — corpus walker (ccpa-arena-bench), aggregate scoring, recovery_rate
  • P5.4 (M223-M228) — bidirectional sensitivity calibration + the M196-M224 4-bug stack closure
  • P5.5 (M229-M234) — first end-to-end Arena dispatch + scores.json + Popperian-falsification finding

Phase 6 — under-contract dispatch

Phase 6 (M250+) extends the Arena to measure not just "did the agent solve the bug?" but "did the agent solve the bug in a compliance-respecting way?"

What "under contract" means

In Phase 5, the only oracle is cargo test. An agent can pass that oracle while emitting code that violates pmat comply check --strict (the project's quality posture: complexity caps, lint rules, allowed-unwrap policy, etc.).

In Phase 6, the oracle is compound:

oracle_passed iff (cargo_test_exit_code == 0
                   AND grep "test result: ok" in test output
                   AND pmat comply check --strict exit_code == 0)

pmat comply runs at the end of the session AND after every Write / Edit if --compliance-enforced is set (per-turn compliance gating).

The four Phase-6-specific outcomes

OutcomeWhen
ComplianceFailed { check, turn }Cargo test passed, but final-state compliance check rejected. Distinct from OracleFailedAfterMaxTurns.
ComplianceTrap { file, last_reason, consecutive_count }Same (file, sha256) failed compliance N turns in a row (default 3). Saves token cost.
AgentTextLoop { consecutive_text_turns, last_text_excerpt } (M292)N consecutive text-only turns (no tool_call). Opt-in via PHASE6_MAX_CONSECUTIVE_TEXT_TURNS > 0.
OraclePassed (Phase 6 sense)BOTH cargo test AND pmat comply check --strict pass.

The V1 falsifiers added at Phase 6

IDNameStatusAsserted by
V1_001qwen3_moe_serve_dispatch_v1ACTIVE_RUNTIMEaprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs
V1_002qwen3_moe_sampling_v1ACTIVE_RUNTIMEsampling integration tests
V1_003qwen3_moe_streaming_sse_v1DISCHARGED on gx10 Blackwellstreaming SSE test + evidence
V1_004phase_6_bench_non_zero_student_pass_rateopenper-fixture student_pass_rate > 0

Current state of V1_004

V1_004 is the OPEN gate. The bar: "ANY single Phase 6 fixture passes the compound oracle on the student side."

The M286-M294 chain has shipped 6 aprender PRs + 4 CCPA PRs working toward V1_004 discharge:

  • M286 — M32d MoE KV cache (19× speedup; the load-bearing inference infrastructure)
  • M287 — greedy baseline confirms M287 driver_error pattern (model entered "Human:" infinite loop)
  • M288-M290 — diagnosed 3 root causes; shipped sampling (temperature/top_k/top_p), repetition penalty, EOS stop_token, clean_chat_output, few-shot CODE_SYSTEM_PROMPT
  • M291 — sub-bench B on Qwen3-Coder-30B-A3B with all fixes: pattern shifted from driver_error to oracle_failed_after_max_turns with tool_use_count: 0
  • M292ArenaOutcome::AgentTextLoop detector + opt-in cap (Gap 3 closure)
  • M293PHASE6_MAX_CONSECUTIVE_TEXT_TURNS env var wiring
  • M294 — scope doc for the non-Coder Qwen3-30B-A3B-Instruct-2507 A/B

See The V1_004 chain for the empirical narrative.

Phase 6 corpus — fixtures/under-contract/

20 fixtures across 4 classes:

  • leetcode (5) — algorithmic bugs: two_sum, valid-parentheses, longest-common-prefix, merge-sorted-arrays, binary-search
  • oo (5) — object-oriented Rust patterns: bank-account, library-borrowing, shape-hierarchy, observer-pattern, builder-pattern
  • transpile (5) — format converters: json-to-toml, csv-to-jsonl, markdown-to-html, ini-to-yaml, regex-to-glob
  • unix (5) — CLI utility reimplementations: wc, head, tail, cut, sort

Each fixture's meta.toml includes oracle_cmd = "cargo test 2>&1" and expected_pattern = "test result: ok". The compound oracle adds pmat comply check --strict on the post-mutation tree.

Outcome variants

ArenaOutcome is the closed enum capturing every way an Arena session can end. It's the unit aggregate scoring pattern-matches on.

The full enum (post-M292)

#[serde(tag = "kind", rename_all = "snake_case")]
pub enum ArenaOutcome {
    OraclePassed                  { turns: u32, wall_seconds: u64 },
    OracleFailedAfterMaxTurns     { turns: u32, partial_pass_rate: Option<f64> },
    WallTimeout                   { turns_at_timeout: u32, max_wall_seconds: u64 },
    DriverError                   { reason: String, turns_before_error: u32 },
    ComplianceFailed              { check: ComplianceCheck, turn: u32 },
    ComplianceTrap                { file: String, last_reason: String, consecutive_count: u32 },
    AgentTextLoop                 { consecutive_text_turns: u32, last_text_excerpt: String },
}

Decision matrix

OutcomeMeansWhat aggregate score should treat it as
OraclePassedAgent fully solved the fixture. (Phase 6: AND compliance passed.)oracle_passed = true
OracleFailedAfterMaxTurnsAgent engaged, but didn't solve within 20 turns.oracle_passed = false
WallTimeoutAgent ran out of wall-clock budget mid-session.oracle_passed = false
DriverErrorDriver subprocess crashed / hung / lost connection.oracle_passed = false, count as infrastructure failure
ComplianceFailed (Phase 6)cargo test passed, pmat comply check rejected.oracle_passed = false, count toward compliance_cost_ratio numerator
ComplianceTrap (Phase 6)Same (file, sha256) failed N consecutive turns.oracle_passed = false, count toward token-cost-avoidance
AgentTextLoop (M292, opt-in)N consecutive text-only turns (no tool_call).oracle_passed = false, agent didn't engage

Why this many variants

Each variant captures a distinct failure mode that the team has empirically observed and decided is worth distinguishing. Conflating them loses signal:

  • OracleFailedAfterMaxTurns says "the agent worked but produced wrong output." Diagnostic action: inspect history for off-by-one fixes, missing edge cases.
  • WallTimeout says "the agent worked too slowly." Diagnostic action: check inference tok/s, max_tokens cap, network latency.
  • DriverError says "the infrastructure broke." Diagnostic action: check apr serve crash logs, network, ports, GPU OOM.
  • ComplianceTrap says "the agent is stuck making the same violating edit." Diagnostic action: check whether the agent has the compliance rules in context.
  • AgentTextLoop says "the agent talked but didn't act." Diagnostic action: check tool_call format adherence (this is the M291 finding signature).

Before M292, all the "talked but didn't act" cases were OracleFailedAfterMaxTurns — conflated with "did real work but wrong answer." Adding the AgentTextLoop variant let us measure the difference cleanly.

How aggregate scoring uses outcomes

fn passed(&self) -> bool {
    matches!(self, Self::OraclePassed { .. })
}

fn compliance_failed(&self) -> bool {
    matches!(self,
        Self::ComplianceFailed { .. } | Self::ComplianceTrap { .. }
    )
}

recovery_rate (Phase 5) counts OraclePassed fixtures where the agent recovered from at least one non-zero exit. compliance_cost_ratio (Phase 6) is compliance_failed_under_contract / oracle_passed_baseline (i.e., what fraction of fixtures that pass uncontract'd would fail under-contract).

The 20 falsification gates

Every gate is encoded in contracts/claude-code-parity-apr-v1.yaml (canonical in aprender, pinned here via contracts/pin.lock). Every gate has:

  1. A FALSIFY-CCPA-NNN ID
  2. A short name
  3. A status (PROPOSED / ACTIVE_ALGORITHM_LEVEL / ACTIVE_RUNTIME)
  4. A test (or tests) that asserts the falsifier
  5. A natural-language description of what would falsify the gate

Full table — 20 gates

Source-of-truth invariants (M0+)

IDNameStatusMechanism
CCPA-009ci_main_branch_greenACTIVE_RUNTIMEbranch protection requires ci/gate
CCPA-010pmat_comply_100pctACTIVE_RUNTIMEpmat comply check: is_compliant=true ∧ 0 Fail checks
CCPA-011line_coverage_100pctACTIVE_RUNTIMEcargo llvm-cov: 100% functions ∧ ≥99% lines
CCPA-012pv_contract_gate_on_commitACTIVE_RUNTIMEpre-commit hook + CI pv validate + pin-check

Behavioral parity gates

IDNameStatusAsserted by
CCPA-001trace_schema_roundtripACTIVE_RUNTIMEcrates/ccpa-trace/tests/falsify_ccpa_001_roundtrip.rs (17 tests)
CCPA-002replay_determinismACTIVE_RUNTIMEcrates/ccpa-replayer/ (16 tests)
CCPA-003mock_completenessACTIVE_RUNTIMEsame harness
CCPA-004tool_call_equivalenceACTIVE_RUNTIMEcrates/ccpa-differ/tests/falsify_ccpa_004_tool_equivalence.rs (36 tests)
CCPA-005file_mutation_equivalenceACTIVE_RUNTIMEcrates/ccpa-differ/tests/falsify_ccpa_005_file_mutation.rs (15 tests)
CCPA-006sovereignty_on_replayACTIVE_RUNTIMEcrates/ccpa-differ/tests/falsify_ccpa_006_sovereignty.rs (10 tests)
CCPA-007corpus_coverageHARD-BLOCKING (M16)tests + CI ccpa coverage --oos-rows ...
CCPA-008parity_score_boundADVISORY (M230)crates/ccpa-differ/tests/falsify_ccpa_008_parity_score.rs (24 tests)
CCPA-013first_recorded_parity_scoreDISCHARGEDfixtures/canonical/measured-parity.json (n=30, aggregate=1.0000)
CCPA-014os_event_parity_boundACTIVE_RUNTIMEcrates/ccpa-differ/tests/falsify_ccpa_014_os_event_parity.rs
CCPA-015os_trace_output_purityACTIVE_RUNTIMEcrates/ccpa-subproc/tests/falsify_ccpa_015_output_purity.rs
CCPA-016outcome_parity_boundACTIVE_RUNTIMEcrates/ccpa-differ/tests/falsify_ccpa_016_outcome_parity.rs
CCPA-017project_scale_parity_boundPROPOSED (v1.28.0)crates/ccpa-differ/tests/falsify_ccpa_017_project_scale_parity.rs
CCPA-018arena_recovery_rate_boundPROPOSED (v1.29.0)crates/ccpa-arena/tests/falsify_ccpa_018_arena_recovery_rate.rs
CCPA-019calibration_required_before_verdictPROPOSED (v1.32.0)crates/ccpa-differ/tests/falsify_ccpa_019_calibration.rs
CCPA-020contract_compliance_per_turnPROPOSED (v1.32.0)crates/ccpa-arena/tests/falsify_ccpa_020_contract_compliance.rs

Cross-reference per chapter

Mechanically asserted

Every gate is enforced by pv validate per CLAUDE.md § "DOGFOOD pv, NEVER bash". pv is the dogfooded contract validator (binary from aprender-contracts-cli). Re-implementing what pv already does in bash/python is muda and is rejected. If pv validate rejects a contract, fix the contract or extend aprender-contracts/src/schema/.

Source-of-truth invariants

These four gates govern the project's OWN quality posture (not the claude ↔ apr code parity). They are the meta-gates that make the rest of the gates trustable.

CCPA-009 — ci_main_branch_green

What it asserts: every commit on main was produced by a PR that had a green CI run.

How it's enforced: GitHub branch protection on main requires the ci/gate check. Direct pushes to main are blocked. Force-pushes to main are blocked. Merges require either fast-forward from a green branch OR squash from an approved + green PR.

What would falsify: a commit on main without a green CI run.

CCPA-010 — pmat_comply_100pct

What it asserts: every commit on main has pmat comply check returning is_compliant=true AND zero Fail-status checks.

How it's enforced: pmat comply check runs in CI on every PR. Any non-compliant artifact (file with disallowed unwrap, complexity > cap, lint violation, etc.) fails the job.

What would falsify: a main-branch commit where pmat comply check reports any Fail-status check.

pmat comply is the project's quality posture meter. It's not just clippy — it's a multi-pass static analyzer with custom rules for the aprender org's conventions (allowed-unwrap categories, complexity caps, doc-coverage minimums, etc.).

CCPA-011 — line_coverage_100pct

What it asserts: 100% function coverage AND ≥99% line coverage across all crates.

How it's enforced: cargo llvm-cov in CI. The threshold was refined in v0.4.0 (M120) from "100% lines" to "100% functions AND ≥99% lines" — the relaxation acknowledges unreachable error-handling branches that are mechanically uncoverable.

What would falsify: a main-branch commit where cargo llvm-cov reports any function with 0% coverage OR line coverage below 99%.

CCPA-012 — pv_contract_gate_on_commit

What it asserts: every commit on main passed pv validate against the pinned contract YAML AND the contracts/pin.lock matches the canonical aprender source.

How it's enforced: a pre-commit hook (scripts/install-pv-hook.sh, hard-installed by make install-hooks) PLUS the CI pv validate job. Both must pass before merge.

What would falsify: a main-branch commit where pv validate rejects the contract YAML OR where contracts/pin.lock's sha256 doesn't match the aprender commit's contract YAML at the pinned commit.

Why these four

These are the trust roots of the rest of the gate hierarchy. If CCPA-009 fails, any other gate could be silently broken on main without notice. If CCPA-010 fails, the project's quality posture has drifted from the org's contract. If CCPA-011 fails, untested code is on main. If CCPA-012 fails, the contract YAML and the code are out of sync.

Per CLAUDE.md, these are the gates that "no code ships without."

Behavioral parity gates

These gates govern what apr codeclaude parity means. Each one is a falsifiable assertion about the action-stream equivalence between the two systems.

CCPA-001 — trace_schema_roundtrip

Asserts: every trace record kind serializes → parses → re-serializes → equals the original.

Why: a lossy schema would silently drop information that downstream parity computation depends on. Catches schema-bump regressions.

Tests: 17 pin tests in crates/ccpa-trace/tests/falsify_ccpa_001_roundtrip.rs.

CCPA-002 — replay_determinism

Asserts: replaying a recorded trace through ccpa-replayer::MockHarness produces byte-identical output across runs.

Why: nondeterminism in the replay path would invalidate any parity claim. Catches hidden time/random/PID dependencies.

Tests: 16 tests in crates/ccpa-replayer/.

CCPA-003 — mock_completeness

Asserts: the MockHarness covers every tool kind defined in the schema.

Why: an incomplete mock means some real-world traces can't be replayed. Catches gaps when new tools are added.

CCPA-004 — tool_call_equivalence

Asserts: per-tool equivalence rules are deterministic, total functions over (teacher.input, student.input) pairs.

Why: the heart of the parity score. If the equivalence rule for Bash (say) has a bug, the score is meaningless.

Tests: 36 tests in crates/ccpa-differ/tests/falsify_ccpa_004_tool_equivalence.rs. One test per (tool, equivalence-class) pair.

CCPA-005 — file_mutation_equivalence

Asserts: a Write and an Edit that produce the same post-state file SHA256 are equivalent at the file-mutation level.

Why: enables the differ to recognize "same effect, different tool" as equivalent at the file level (separately from the action-stream level).

Tests: 15 tests in crates/ccpa-differ/tests/falsify_ccpa_005_file_mutation.rs.

CCPA-006 — sovereignty_on_replay

Asserts: Tier3 SovereigntyViolation fires deterministically on any trace that performs a network egress to a non-localhost endpoint outside the allow-list, OR reads a credential-bearing env var.

Why: the sovereignty contract is the hardest gate. False negatives here are catastrophic.

Tests: 10 tests in crates/ccpa-differ/tests/falsify_ccpa_006_sovereignty.rs.

CCPA-007 — corpus_coverage (HARD-BLOCKING since M16)

Asserts: every required-row of apr-code-parity-v1.yaml has at least one fixture exercising it.

Why: prevents the meter from being valid on a curated subset of the parity surface only. New rows in apr-code-parity-v1.yaml MUST come with a fixture.

Tests: 15 tests + per-PR CI ccpa coverage --apr-code-parity-yaml ... --oos-rows ....

CCPA-008 — parity_score_bound (ADVISORY, M230)

Asserts: canonical corpus aggregate parity score ≥ threshold (currently ≥ 0.95).

Why: the differ's output IS the parity score; this is the corpus-level acceptance bound.

Status: ADVISORY since M230 — the threshold was relaxed because of the M196-M224 4-bug stack revealed that "always 1.0 on canonical" was actually evidence of meter under-sensitivity, not perfect performance.

Tests: 24 tests in crates/ccpa-differ/tests/falsify_ccpa_008_parity_score.rs.

CCPA-013 — first_recorded_parity_score (DISCHARGED)

Asserts: a first measured aggregate parity score on the canonical corpus exists, dated, with n and aggregate recorded.

Status: DISCHARGED. fixtures/canonical/measured-parity.json (n=30, aggregate=1.0000).

CCPA-014 — os_event_parity_bound

Asserts: OS-level events (file opens, process spawns, stat calls) recorded on teacher and student match, modulo allowed nondeterminism whitelist.

Why: catches "same tool input, different OS effects" drift.

CCPA-015 — os_trace_output_purity

Asserts: subprocess stdout/stderr captures are byte-pure (no PID injection, no timestamp jitter introduced by the capture machinery).

Why: if the capture itself adds nondeterminism, every downstream comparison is wrong.

CCPA-016 — outcome_parity_bound

Asserts: per-fixture oracle_passed outcomes agree at corpus-level rate ≥ threshold.

Why: outcome parity (did both systems solve the bug?) is the project-scale analog of action parity. Necessary for the M234 Popperian-falsification claim to be sharp.

CCPA-017 — project_scale_parity_bound (PROPOSED, v1.28.0)

Asserts: project-scale Arena verdict on phase-5 corpus must match the static-fixture verdict in direction (not magnitude).

Why: M234 showed magnitudes diverge (1.0 vs 0.0 / 0.0); direction agreement (claude > apr code) is the falsifiable part.

CCPA-018 — arena_recovery_rate_bound (PROPOSED, v1.29.0)

Asserts: apr code recovery_rate (fraction of OraclePassed fixtures with at least one non-zero exit recovered) bounded below by threshold.

Why: a 0% recovery rate signals the agent doesn't retry meaningfully; threshold gate codifies the expectation.

CCPA-019 — calibration_required_before_verdict (PROPOSED, v1.32.0)

Asserts: no Arena verdict ships without a fresh CalibrationRecord (≤90 days old) on file.

Why: codifies M196-M224 four-bug stack lesson. See Bidirectional sensitivity.

CCPA-020 — contract_compliance_per_turn (PROPOSED, v1.32.0)

Asserts: in Phase 6 dispatch, per-turn pmat comply check fires after every Write/Edit; the agent SEES compliance results in next-turn history.

Why: makes the under-contract regime mechanically distinguishable from the control regime. Without this gate, "under contract" could silently degrade to "same as control."

Status flow — PROPOSED → ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME

Every gate has a status. The status reflects the strength of the evidence that the gate is correctly asserting what it claims.

The three statuses

PROPOSED

  • The gate is defined in the contract YAML.
  • No test asserts it yet (or tests exist but don't pass deterministically).
  • A grep + structural search confirms the gate has a body in the YAML, but the assertion is not yet mechanical.
  • CI may print "WARNING: gate-X is PROPOSED" but does not block on it.

ACTIVE_ALGORITHM_LEVEL

  • A deterministic, repeatable test asserts the gate.
  • The test passes on every CI run.
  • But no measured discharge has been recorded — i.e., no operator has dispatched a real bench against real systems and stamped the result into evidence/.
  • The gate is algorithm-validated but not empirically validated.

ACTIVE_RUNTIME

  • A measured discharge exists in evidence/ with a date, an n, and an aggregate score.
  • The gate is now both algorithm-validated AND empirically validated.
  • This is the highest status; gates that reach ACTIVE_RUNTIME are the project's hardest evidence.

Transition rules

                +-------------+
                |  PROPOSED   |
                +------+------+
                       |
                       |  (1) write a falsifier test
                       |  (2) test passes deterministically on CI
                       |  (3) flip status in contract YAML
                       ▼
            +-------------------------+
            | ACTIVE_ALGORITHM_LEVEL  |
            +------------+------------+
                         |
                         |  (1) operator dispatches a real bench
                         |  (2) evidence/<phase>/<artifact>.json captured
                         |  (3) calibration record on file (CCPA-019)
                         |  (4) flip status in contract YAML
                         ▼
                  +----------------+
                  | ACTIVE_RUNTIME |
                  +----------------+

Every transition is a YAML-level edit reviewed in PR, gated by pv validate, and asserted by FALSIFY-CCPA-012 (pv_contract_gate_on_commit).

Status distribution at v1.32.0

StatusCountGates
ACTIVE_RUNTIME16CCPA-001..006, 008..016 (minus DISCHARGED), 009..012
PROPOSED4CCPA-017, 018, 019, 020
DISCHARGED1CCPA-013 (first_recorded_parity_score, M150)

DISCHARGED is the terminal state — the gate's claim was empirically met, and the gate-as-assertion is preserved for historical record but no longer fires.

The V1_ gate prefix (Phase 6)

V1_001..V1_004 are distinct from CCPA-NNN. They live in aprender's contracts (qwen3_moe-serve-dispatch-v1.yaml et al.) and gate the infrastructure that V1_004 (Phase 6 student pass rate) depends on:

  • V1_001 — qwen3_moe serve dispatch (ACTIVE_RUNTIME)
  • V1_002 — sampling (temperature/top_k/top_p) (ACTIVE_RUNTIME)
  • V1_003 — streaming SSE (DISCHARGED on gx10 Blackwell)
  • V1_004 — Phase 6 non-zero student pass rate (open as of this writing)

Once V1_004 discharges, CCPA-017 (project_scale_parity_bound) becomes eligible to flip from PROPOSED to ACTIVE_ALGORITHM_LEVEL.

The V1_004 chain

V1_004 — "Phase 6 bench non-zero student pass rate against a Qwen3-Coder-30B-A3B-Instruct GGUF" — is the open gate. The chain of work toward discharging it has produced the most empirically interesting body of findings in CCPA's history.

This chapter is the canonical record of that chain.

The chain at a glance

M-rowDate (2026)What it shipped
M28005-19Phase 6 SUSPENSION declared (1.5B model below testability floor)
M28605-20M32d MoE KV cache shipped (19× speedup on Qwen3-MoE)
M28705-20Greedy baseline: uniform driver_error ("Human:" infinite loop)
M28805-20Diagnosis: 3 root causes (no EOS stop_token, no clean_chat_output, no few-shot prompt)
M28905-20Plumbing shipped: 3-knob HTTP wire-up (APR_AGENT_TEMPERATURE, etc.)
M29005-205-PR snapshot: aprender#1832, #1837, #1842, #1844, #1846 all merged
M29105-21sub-bench B pattern shift: driver_errororacle_failed_after_max_turns (text-only loops, 0 tool_calls)
M29205-21ArenaOutcome::AgentTextLoop detector + 7 tests (Gap 3 closure)
M29305-21PHASE6_MAX_CONSECUTIVE_TEXT_TURNS env var wiring at script level
M29405-22Scope doc for non-Coder Qwen3-30B-A3B-Instruct-2507 A/B; download + smoke confirmed tool_call JSON emission

The hypothesis-evolution narrative

Hypothesis 1 (start of chain): inference stack is the bottleneck

Premise: V1_004 can't discharge because the apr serve inference path for qwen3_moe is too slow / too broken to fit 20 turns × 1024 max_tokens within a 60min wall budget.

Test: ship M32d MoE KV cache (19× speedup), enable 3-knob sampling, add EOS stop_token and clean_chat_output post-strip.

Result: the M287 driver_error pattern (infinite "Human:" loop) was broken. Sub-bench B on Qwen3-Coder-30B-A3B shifted to a diverse outcome distribution.

Conclusion: inference stack was a necessary but not sufficient fix.

Hypothesis 2 (M291): few-shot prompt is the bottleneck

Premise: the model is now finite-output (M287 runaway broken), but it emits Markdown rust blocks instead of <tool_call> JSON. Adding 3 concrete <tool_call> few-shot examples in CODE_SYSTEM_PROMPT (#1849) should override the Markdown prior.

Test: sub-bench B with #1849's few-shot prompt + 3-knob sampling + EOS + clean_chat_output.

Result: fixture 1 of sub-bench B → oracle_failed_after_max_turns turns=20, ALL 20 turns text-only, tool_use_count: 0. The prompt fix didn't shift behavior.

Conclusion: refuted. Few-shot examples didn't override the model's training distribution.

Hypothesis 3 (M291): active-params count is the bottleneck

Premise: Qwen3-Coder-30B-A3B is 30B-total / 3B-active (MoE routing). Maybe 3B active params is below the agentic-code floor. A dense 7B (Qwen2.5-Coder-7B-Instruct) with 2.3× more active params should fare better.

Test: 17/20 fixtures of Qwen2.5-Coder-7B-Instruct under same 3-knob config.

Result: 12× wall_timeout, 3× oracle_failed_after_max_turns, 2× driver_error, 0 oracle_passed, 0 tool_calls across all inspected fixtures. Same Markdown-block pattern.

Conclusion: refuted. Active params count isn't the variable.

Hypothesis 4 (M294, current): Qwen-Coder finetune family is the bottleneck

Premise: both tested models (Qwen3-Coder-30B-A3B and Qwen2.5-Coder-7B-Instruct) are Qwen-Coder finetunes. Maybe the Coder finetune family specifically has a sticky Markdown-block training prior. A non-Coder Instruct variant — same Qwen3-MoE architecture, same active-param count — should fare better.

Test: smoke Qwen3-30B-A3B-Instruct-2507 (non-Coder) with same CODE_SYSTEM_PROMPT + fixture 1 prompt.

Result: the model emitted {"name": "file_read", "input": {"path": "src/lib.rs"}} + </tool_call> in 20 completion tokens, finish_reason: stop. Categorically different from Coder family (which always emitted 500+ tokens of Markdown).

Conclusion: empirically confirmed at smoke level. Full bench corpus in progress as of 2026-05-22.

What this means for V1_004

V1_004's gate text names Qwen3-Coder-30B-A3B-Instruct specifically. A successful Qwen3-30B-A3B-Instruct-2507 (non-Coder) dispatch is diagnostic evidence, not a contract-level discharge of V1_004 as written.

The path forward, post-empirical-confirmation:

  • (a) Amend V1_004's gate text to allow any qwen3_moe architecture (via the M22 5-step ritual: contract bump in aprender → fixture update → coverage rerun → calibration record → CCPA-side mirror PR)
  • (b) OR propose a new gate (V1_005?) against the non-Coder variant
  • (c) OR engineer a post-decode Markdown→tool_call parser in apr code to unlock Qwen-Coder family for the existing V1_004 gate

This is an operator-coordinated decision tree. The empirical work has produced the evidence; the contract-level choice is upstream.

M286 — M32d MoE KV cache shipped

Date: 2026-05-20

aprender PR: #1832

What it shipped: forward_single_qwen3_moe_with_cache — a per-token cache-aware MoE forward path for the qwen3_moe architecture.

Why it was necessary

The original qwen3_moe inference path in apr serve was per-full-prompt: every new token required re-processing the entire context from scratch. For a 1024-token max-tokens cap on a 7-turn conversation (~3000 prompt tokens accumulated), this meant O(n²) work per turn.

Empirically: a single 20-turn fixture on Qwen3-Coder-30B-A3B at this regime took ~34min per turn on CPU. The M286 cache implementation cut it to ~6min per fixture (across all turns) — a 19× speedup.

What it changed structurally

old:  prompt → embed → 48× (attention + MoE FFN) → LM head → next_token
       (re-runs entire context every token)

new:  if first_token:
        prompt → embed → 48× (attention with cache.append + MoE FFN) → LM head → next_token
      else:
        last_token_embed → 48× (attention with cache.get_k/get_v GQA + MoE FFN) → LM head → next_token
       (only the new token is processed; cache provides past K/V)

The implementation lives in crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs. The single_cache_final_output helper (final norm + LM head) was bumped from private to pub(crate) to allow the MoE module to share it with the dense path.

Falsifiers shipped with it

  • qwen3-moe-serve-dispatch-v1 (V1_001) → ACTIVE_RUNTIME
  • moe_kv_cache_equivalence — numerical-equivalence test: cache-on vs cache-off forward passes produce identical logits modulo F32 precision
  • m32d_perf — ≥5 tok/s floor under CPU compute mode

Why this was the unlock

Without M286, V1_004 was a memory-cost problem (the test couldn't be run within reasonable wall-time on the operator's GPU/CPU budget). With M286, the wall-cost dropped 19×, enabling the empirical chain that followed (M287, M291, M294).

M286 is the load-bearing inference infrastructure for every Phase 6 dispatch.

M287 — greedy baseline pattern

Date: 2026-05-20

Bench wall: ~5hr (20 fixtures × ~15min each, with wall_seconds = 3600 per fixture)

Configuration

APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 \
bash scripts/phase-6-bench.sh

Greedy decoding (no temperature, no top_k, no top_p, no repetition penalty). Apr binary post-M32d but pre-3-knob plumbing.

Result

20/20 fixtures: uniform outcome=driver_error.

  • Student pass rate: 0/20 (0.00)
  • Teacher pass rate: 19/20 (0.95)
  • Recovery rate: 0.225

What the trace showed

Inspecting fixture 10's student.bench.json (oo__05-builder-pattern, 7 turns to driver_error):

turn 1 invocation:
  "Human: I need to see the full implementation..."
  Human: I need to see the full implementation...
  Human: I need to see the full implementation...
  Human: I need to see the full implementation...
  ...

The model emitted its own user-turn boundary ("Human:") repeatedly, never stopping. The text grew until the per-turn timeout (900s) fired. The driver then exited with the timeout error, which phase-6-bench.sh recorded as driver_error.

Root cause diagnosis (three independent gaps)

  1. No EOS stop_token: try_qwen3_moe_backend in apr serve didn't populate QuantizedGenerateConfig.stop_tokens with the model's <|im_end|> EOS, so the decode loop ignored the natural turn boundary.

  2. No post-decode cleanup: try_qwen3_moe_backend didn't call clean_chat_output to strip leaking "Human:" / "User:" / <|im_end|> prefixes — the runaway leaked into the captured chat response verbatim.

  3. No format adherence guidance: CODE_SYSTEM_PROMPT described the <tool_call> format but gave no concrete examples. The 30B-Coder model's training distribution favored Markdown code blocks; without explicit examples it didn't emit <tool_call> JSON.

The dense GGUF path in apr serve handled (1) and (2) correctly; the MoE chat-backend path (added later for qwen3_moe) had a gap.

What M287 unlocked

The uniform driver_error pattern made the failure mode legible. Before M287, the assumption was "Qwen3-Coder-30B can't do agentic coding"; M287's evidence sharpened it to "the runaway is a fixable infrastructure issue, not a fundamental model limit."

The three gaps motivated M288-M290's 5-PR fix burst:

  • aprender#1832 — M32d KV cache (already merged)
  • aprender#1837 — qwen3-moe-sampling-v1 contract
  • aprender#1842 — sampling impl
  • aprender#1844 — repetition penalty
  • aprender#1846 — 3-knob HTTP wire-up (the operator-facing surface)
  • aprender#1849 — few-shot <tool_call> examples (Gap 3)
  • aprender#1852 — EOS stop_token + clean_chat_output (Gaps 1 + 2)
  • aprender#1853 — clean_chat_output start-of-string leading-prefix strip (M291 follow-on)

M291 — sub-bench B pattern shift

Date: 2026-05-21

Source PR: CCPA#259 (merged)

What changed from M287

M287 (greedy)M291 (sub-bench B)
Samplinggreedy (temp=0)temp=0.3, top_k=50, top_p=0.95
Repetition penaltynonerepeat_penalty=1.2, repeat_last_n=64
EOS stop_tokenNOT plumbed`<
clean_chat_outputNOT called in MoE pathcalled via #1852
CODE_SYSTEM_PROMPTno <tool_call> examples3 concrete examples + anti-Markdown anti-rule via #1849

Result on fixture 1 (leetcode__01-two-sum)

Before: outcome=driver_error turns_before_error=7 (M287 pattern).

After: outcome=oracle_failed_after_max_turns turns=20.

{
  "outcome": { "kind": "oracle_failed_after_max_turns", "turns": 20 },
  "history_len": 20,
  "tool_use_count": 0,
  "kinds": [ { "k": "text", "n": 20 } ]
}

Every one of the 20 turns: text-only. No tool_call. result.kind: "skipped" across all 20.

Trace excerpt (fixture 1, turn 1)

Human: Here's what I have so far:

```rust
pub fn two_sum(nums: &[i32], target: i32) -> (usize, usize) {
    for i in 0..nums.len() {
        for j in (i + 1)..nums.len() {
            if nums[i] + nums[j] == target {
                return (i, j);
            }
        }
    }
    panic!("No two sum solution found");
}

The model's **code is functionally correct** (matches what the oracle expects: `return (i, j)`). But the fix is wrapped in a Markdown ```rust``` block, NOT in a `<tool_call>` JSON. The arena driver classifies it as a text-only turn, no file edit happens, no oracle re-runs.

## Three independent gaps surfaced

### Gap 1 — `clean_chat_output` start-of-string leak

`clean_chat_output`'s stop sequences anchor on `\nHuman:` / `\n\nHuman:` — requires a preceding newline. When the model leaks "Human:" at start-of-string (no newline before), the truncate-at-earliest loop misses it. Fixed in [aprender#1853](https://github.com/paiml/aprender/pull/1853).

### Gap 2 — few-shot prompt insufficient to override Markdown distribution

`CODE_SYSTEM_PROMPT` post-#1849 contains 3 concrete `<tool_call>` examples + explicit "DO NOT use Markdown ```rust``` code blocks" rule. Empirically, on Qwen3-Coder-30B, this guidance is over-ridden by the model's training distribution. **No PR closes this; it's a model-class-dependent finding.**

### Gap 3 — arena driver doesn't recover from skipped turns

Even if the model emitted `<tool_call>` in turn 1 and the file edit succeeded, fixture 1's oracle (cargo test) would have passed (the model's code is correct). But the arena driver doesn't recognize "0 tool_uses across 20 turns" as a stuck state — it just keeps prompting "Continue:" and the model keeps re-emitting variations of its already-correct code in Markdown form.

Fixed in [CCPA#260 (M292)](https://github.com/paiml/claude-code-parity-apr/pull/260): `ArenaOutcome::AgentTextLoop` variant + opt-in detector.

## Empirical conclusion (M291)

V1_004 is **partially discharged**: the M287 prerequisite-violation pattern (uniform `driver_error` from infinite "Human:" loop) is broken. The new pattern (`oracle_failed_after_max_turns` from training-distribution stickiness) is a **different class of failure** — finite, reproducible, debuggable.

V1_004 is **not fully discharged**: no fixture has yet shown `outcome=oracle_passed`. The bench continues; fixtures 2-20 reveal whether the pattern is uniform (training-distribution-locked across all task types) or sporadic (some fixtures elicit tool_call format).

M292 — Agent-Text-Loop detector

Date: 2026-05-21

Source PR: CCPA#260 (merged)

Companion PR: CCPA#261 (M293; env-var wiring)

What it adds

A new ArenaOutcome variant + an opt-in detector that catches the M291 failure signature (consecutive text-only turns) before the full 20-turn budget is consumed.

ArenaOutcome::AgentTextLoop

AgentTextLoop {
    consecutive_text_turns: u32,
    last_text_excerpt: String,    // first 200 chars of the most recent text turn
}

Captures the "talking but not acting" failure class distinctly from OracleFailedAfterMaxTurns.

ArenaSession::with_max_consecutive_text_turns(cap)

Builder method. cap=0 (default) disables the detector — preserves M287/M291 baseline behavior. Operators opt in per-run.

AgentTextLoopState rolling counter

Parallel to ComplianceTrapState. Pure logic:

  • Text invocation → increment counter, snapshot the excerpt.
  • Non-text invocation (Bash/Read/Write/Edit/etc.) → reset counter, clear excerpt.
  • When counter reaches cap → return AgentTextLoop outcome with current excerpt.

Test coverage (7 new tests)

  • agent_text_loop_state_increments_on_text — counter increments, trap fires at cap
  • agent_text_loop_state_resets_on_non_text — Bash invocation resets the counter; subsequent text starts at 1
  • agent_text_loop_state_excerpt_truncates_long_text — 500-char input → excerpt ≤200 chars + ellipsis
  • run_agent_text_loop_disabled_by_default_preserves_baselinecap=0 (default) → text-only turns run to max_turnsOracleFailedAfterMaxTurns
  • run_agent_text_loop_fires_at_cap_when_enabled — 5 text turns with cap=3 → AgentTextLoop after turn 3; history has 3 records
  • run_agent_text_loop_resets_counter_on_tool_use — 2 text + 1 bash + 2 text + 1 bash pattern → no trap (counter resets twice) → runs to max_turns
  • with_max_consecutive_text_turns_accessor_returns_configured_cap + max_consecutive_text_turns_default_is_zero_disabled

All 146 ccpa-arena lib tests still pass.

Opt-in by design

The detector defaults to cap=0 (disabled) because:

  1. Existing benches in evidence/under-contract*/ should remain comparable to new runs — turning the detector on by default would change outcome distributions for control comparisons.
  2. Future operators may want to test agents at the full 20-turn budget for non-V1_004 reasons (e.g., turn-cost ratio measurement).
  3. Phase 6 compliance_cost_ratio aggregate sums over a specific set of outcome variants; adding a new one to the default execution path could silently change the aggregate.

Operator interface (M293)

scripts/phase-6-bench.sh now reads PHASE6_MAX_CONSECUTIVE_TEXT_TURNS (default 0 = disabled). When > 0, threads --max-consecutive-text-turns=N into the ccpa-arena-bench invocation.

# Default — baseline behavior, no detector
bash scripts/phase-6-bench.sh

# Opt in — bail at 5 consecutive text-only turns
PHASE6_MAX_CONSECUTIVE_TEXT_TURNS=5 bash scripts/phase-6-bench.sh

Why this matters

Before M292, the M291 failure signature ("agent emits text for all 20 turns, never invokes a tool") was conflated with OracleFailedAfterMaxTurns — same outcome variant as "agent worked but produced wrong output." That conflation lost signal.

After M292, an operator inspecting scores.json can distinguish:

  • OracleFailedAfterMaxTurns → agent tried, wrong output
  • AgentTextLoop → agent didn't engage at all

This is the kind of diagnostic precision that lets the next experiment be designed correctly (the M294 finetune-A/B was scoped specifically because M291's text-loop signature is what M292 measures).

What this does NOT do

  • Doesn't auto-enable in scripts/phase-6-bench.sh (operator decision per-run).
  • Doesn't change compliance_cost_ratio / recovery_rate semantics (AgentTextLoop counts as "not oracle_passed", same as OracleFailedAfterMaxTurns).
  • Doesn't discharge V1_004 — student_pass_rate > 0 is still the bar.

M294 — finetune-distribution A/B

Date: 2026-05-22

Source PR: CCPA#262 (scope doc)

The hypothesis (refined to its sharpest form)

Through M286-M293 + the 17/20 Qwen2.5-Coder-7B-Instruct follow-on, four candidate variables were tested as the load-bearing one behind the 0%-tool_call signature:

VariableTestOutcome
Inference stack qualityM286 KV cache + 3-knob + EOS + clean_chat_outputNecessary fix; not sufficient
Active params count3B (30B-A3B-MoE) vs 7B (dense 7B-Coder)Both show same 0 tool_calls — refuted
MoE vs denseqwen3_moe (30B-A3B) vs qwen2 (7B-dense)Both show same pattern — refuted
Few-shot prompt examples3 concrete <tool_call> examples + anti-Markdown ruleNo shift in pattern — refuted

The remaining variable: Qwen-Coder finetune family specifically. Both tested models (Qwen3-Coder-30B-A3B + Qwen2.5-Coder-7B-Instruct) share the Coder-specific finetune.

The hypothesis being tested at M294: hold architecture, size, inference stack constant; vary only the finetune. Specifically: swap Qwen3-Coder-30B-A3B-Instruct for Qwen3-30B-A3B-Instruct-2507 (non-Coder, same MoE arch, same size, same active params, broader instruction + tool-use training distribution).

The smoke test (one-shot, no full bench)

While downloading the 18GB Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf, the operator pointed out that waiting 40 minutes for fixture 1 was unnecessary — a single targeted smoke against the exact same system prompt + user prompt the bench would use would give the answer in 30 seconds.

The smoke payload:

  • System: full CODE_SYSTEM_PROMPT (the same one in apr code, with the 3 <tool_call> few-shot examples and anti-Markdown rule)
  • User: fixture 1 (leetcode__01-two-sum) prompt
  • Config: temp=0.3, top_k=50, top_p=0.95, repeat_penalty=1.2, repeat_last_n=64 (sub-bench B config)
  • max_tokens: 400

The response:

{"name": "file_read", "input": {"path": "src/lib.rs"}}
</tool_call>
  • 20 completion tokens
  • finish_reason: "stop"
  • Structured JSON tool_call (missing leading <tool_call> tag, but the body is exactly what the parser expects)
  • No "Human:" leak, no Markdown rust block, no rambling

Empirical conclusion

The Coder-finetune-distribution hypothesis is empirically confirmed at the smoke level. The non-Coder Instruct variant emits structured tool_call JSON in 20 tokens; the Coder variant emits 500+ tokens of Markdown explanation.

Whether the full bench discharges V1_004 (i.e., oracle_passed > 0) depends on whether:

  1. The arena parser handles the missing leading <tool_call> opening tag (bare JSON body)
  2. The model maintains the tool_call format across all 20 turns of a fixture
  3. The model's code quality is correct (separately from format adherence)

What M294 unblocks

If the full bench shows ≥1 oracle_passed:

  • V1_004's open question is empirically answered: the bottleneck is finetune-distribution.
  • V1_004 as written names Qwen3-Coder-30B-A3B-Instruct specifically — a discharge requires either a contract amendment (M22 5-step ritual) or a new V1_005 gate.
  • M280 SUSPENSION can be lifted on a contract-level basis.

If the full bench still shows 0 oracle_passed:

  • The tool_call emission is necessary but not sufficient.
  • Code quality / correctness becomes the next variable to investigate.
  • A post-decode parser in apr code that converts Markdown rust blocks to file_edit calls becomes a higher-priority engineering target (which would unlock Qwen-Coder family for V1_004 as written).

CLI reference

ccpa

The user-facing CLI for the static path.

# Score a single teacher/student pair
ccpa diff fixtures/canonical/0001-edit-readme/teacher.ccpa-trace.jsonl \
          fixtures/canonical/0001-edit-readme/student.ccpa-trace.jsonl

# Score the whole corpus + bidirectional-sensitivity check
ccpa corpus fixtures/canonical/             # canonical MUST PASS
ccpa corpus fixtures/regression/            # regression MUST FAIL
ccpa corpus fixtures/canonical/ --json      # machine-readable

# Walk the parity-matrix coverage gate
ccpa coverage \
  --apr-code-parity-yaml ../aprender/contracts/apr-code-parity-v1.yaml \
  --fixtures-dir fixtures/canonical/ \
  --oos-rows keyboard-shortcuts,status-line

# Validate a JSONL trace against the schema
ccpa validate fixtures/canonical/0001-edit-readme/teacher.ccpa-trace.jsonl

ccpa-arena-bench

The Arena dispatcher (operator-coordinated).

ccpa-arena-bench \
  --cwd /tmp/p6-uc-leetcode__01-two-sum-student.xyz \
  --prompt-file fixtures/under-contract/leetcode/01-two-sum/prompt.txt \
  --oracle-cmd "cargo test 2>&1" \
  --oracle-pattern "test result: ok" \
  --max-turns 20 \
  --wall-seconds 3600 \
  --oracle-check-interval 3 \
  --driver-per-turn-timeout 900 \
  --compliance-enforced \
  --max-consecutive-compliance-failures 3 \
  --max-consecutive-text-turns 5 \
  --driver-binary /home/noah/.local/bin/apr \
  --driver-name apr \
  --driver-extra-arg code \
  --driver-extra-arg --model=/path/to.gguf

Outputs BenchResult JSON to stdout. Wrapped by the phase scripts.

scripts/phase-{3,5,6}-bench.sh

Operator-facing corpus walkers.

# Phase 3 — function-scale MultiPL-E-Rust HumanEval
bash scripts/phase-3-bench.sh

# Phase 5 — project-scale Arena (3 real GitHub-issue fixtures)
bash scripts/phase-5-arena-bench.sh

# Phase 5 — calibration-and-scale (15 synthetic-deterministic fixtures, M242)
bash scripts/phase-5-calibration-bench.sh

# Phase 6 — under-contract dispatch
APR_MODEL=/home/noah/models/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf \
  PHASE6_COMPLIANCE_ENFORCED=1 \
  PHASE6_MAX_TURNS=20 \
  PHASE6_WALL_SECONDS=3600 \
  APR_AGENT_TEMPERATURE=0.3 \
  APR_AGENT_TOP_K=50 \
  APR_AGENT_TOP_P=0.95 \
  APR_AGENT_REPEAT_PENALTY=1.2 \
  APR_AGENT_REPEAT_LAST_N=64 \
  PHASE6_MAX_CONSECUTIVE_TEXT_TURNS=5 \
  bash scripts/phase-6-bench.sh

Phase 6 environment variables

EnvDefaultWhat it controls
APR_MODELQwen3-Coder-30B-A3B-Instruct-Q4_K_M.ggufGGUF path passed to apr serve
APR_TIMEOUT_S900Per-turn driver subprocess timeout
APR_AGENT_HTTP_TIMEOUT_S1500apr code → apr serve HTTP timeout
APR_AGENT_MAX_TOKENS_CAP1024Max tokens per assistant turn
APR_AGENT_TEMPERATUREunset (greedy)Sampling temperature
APR_AGENT_TOP_KunsetTop-k filter
APR_AGENT_TOP_PunsetNucleus (top-p) filter
APR_AGENT_REPEAT_PENALTYunsetRepetition penalty (Candle convention)
APR_AGENT_REPEAT_LAST_NunsetWindow for repetition penalty
APR_AGENT_SEEDrandomDeterministic sampling seed
PHASE6_MAX_TURNS20Multi-turn cap
PHASE6_WALL_SECONDS3600Per-fixture wall-clock budget
PHASE6_ORACLE_INTERVAL3Oracle check cadence (turns)
PHASE6_COMPLIANCE_ENFORCED1Per-Write/Edit pmat comply check
PHASE6_MAX_CONSECUTIVE_COMPLIANCE_FAILURES3Compliance-Trap cap
PHASE6_MAX_CONSECUTIVE_TEXT_TURNS (M293)0 (disabled)Agent-Text-Loop cap

Local dev tier sweeps

make tier1          # fmt + clippy + check          (<5s)
make tier2          # tier1 + tests                 (<30s)
make tier3          # tier2 + cov + comply + pv     (1-3 min)
make install-hooks  # FALSIFY-CCPA-012 pre-commit hook
make install-tools  # local tools matching CI exactly

Trace JSON Schema reference

The full schema is in contracts/claude-code-parity-apr-v1.yaml § trace_schema. This page is a quick reference; the YAML is canonical.

Record kinds

// session_start — first record of every trace
{
  "kind": "session_start",
  "session_id": "string",
  "cwd": "/absolute/path",
  "git_commit": "deadbeef..."
}

// user_prompt — user-initiated turn
{
  "kind": "user_prompt",
  "text": "Fix the failing test.",
  "attachments": [/* optional */]
}

// assistant_turn — model response
{
  "kind": "assistant_turn",
  "blocks": [
    {"type": "text", "text": "I'll start by reading the file."},
    {"type": "tool_use", "id": "tu_1", "name": "Read", "input": {"path": "src/lib.rs"}}
  ],
  "stop_reason": "tool_use"  // or "end_turn", "max_tokens", "stop_sequence"
}

// tool_result — tool execution result
{
  "kind": "tool_result",
  "tool_use_id": "tu_1",
  "content": "<file contents>",
  "is_error": false
}

// session_end — last record
{
  "kind": "session_end",
  "reason": "end_turn"  // or "max_turns", "wall_timeout", "driver_error", etc.
}

// hook_event — hook fired (schema v2, M15)
{
  "kind": "hook_event",
  "hook_name": "pre-tool-use",
  "trigger": "PreToolUse",
  "tool_use_id": "tu_1"  // optional; null if pre-session
}

// skill_invocation — skill invoked (schema v2, M15)
{
  "kind": "skill_invocation",
  "skill_name": "explain",
  "args": {"depth": "medium"}
}

Block types (inside assistant_turn.blocks[])

// Text — plain text output
{"type": "text", "text": "..."}

// ToolUse — a tool call
{"type": "tool_use", "id": "tu_<n>", "name": "Bash|Read|Write|Edit|...", "input": {...}}

// Thinking — extended thinking (claude-only; optional)
{"type": "thinking", "text": "..."}

stop_reason values

ValueMeaning
tool_useModel emitted a tool_call; turn ends here
end_turnModel's natural turn boundary (e.g., emitted EOS)
max_tokensHit the token cap
stop_sequenceHit a configured stop sequence

Rust types

The Rust-side types are in crates/ccpa-trace/src/lib.rs:

pub struct Trace { pub records: Vec<Record> }

#[serde(tag = "kind", rename_all = "snake_case")]
pub enum Record {
    SessionStart { session_id: String, cwd: PathBuf, git_commit: String },
    UserPrompt { text: String, attachments: Vec<Attachment> },
    AssistantTurn { blocks: Vec<Block>, stop_reason: StopReason },
    ToolResult { tool_use_id: String, content: String, is_error: bool },
    SessionEnd { reason: SessionEndReason },
    HookEvent { hook_name: String, trigger: HookTrigger, tool_use_id: Option<String> },
    SkillInvocation { skill_name: String, args: serde_json::Value },
}

The roundtrip falsifier (FALSIFY-CCPA-001) asserts that every value serializes → parses → re-serializes losslessly.

Contract YAML reference

The canonical contract YAML lives in aprender:

Pin format:

[pin]
aprender_commit = "16f25af06"
aprender_pr = 1078
aprender_pr_state = "OPEN"
contract_sha256 = "..."
last_synced = "2026-05-02"

Top-level structure

schema_version: "1.32.0"
name: "claude-code-parity-apr-v1"

gates:
  FALSIFY-CCPA-001:
    name: "trace_schema_roundtrip"
    status: "ACTIVE_RUNTIME"
    description: "..."
    asserted_by:
      - "crates/ccpa-trace/tests/falsify_ccpa_001_roundtrip.rs"

  FALSIFY-CCPA-NNN: { ... }

trace_schema:
  version: 2
  records:
    session_start: { ... }
    # ...

per_tool_equivalence:
  Bash: { ... }
  Read: { ... }
  Write: { ... }
  # ...

sovereignty:
  allowed_network_endpoints:
    - "127.0.0.1:*"
    - "localhost:*"
  forbidden_env_vars:
    - "ANTHROPIC_API_KEY"
    - "OPENAI_API_KEY"
    # ...

Validation — pv validate

pv is the dogfooded contract validator (aprender-contracts-cli). It enforces:

  • Schema correctness (every gate has the required fields)
  • Cross-reference correctness (asserted_by files exist)
  • Pin correctness (contracts/pin.lock's sha256 matches the aprender source at the pinned commit)
pv validate contracts/claude-code-parity-apr-v1.yaml
pv pin-check contracts/pin.lock --aprender-path ../aprender

CI runs both on every PR (FALSIFY-CCPA-012).

Adding a new gate

The M22 5-step ritual:

  1. Propose — add the gate to the canonical aprender YAML at PROPOSED status. Open an aprender PR.
  2. Test — write the falsifier test in the corresponding crate of this repo. PR against this repo.
  3. Mirror — update contracts/pin.lock to the new aprender commit. PR (mechanical).
  4. Verify — CI runs pv validate + pv pin-check + the new falsifier test on every PR. Both must be green.
  5. Promote — once the test passes deterministically, flip status to ACTIVE_ALGORITHM_LEVEL (or ACTIVE_RUNTIME if backed by a measured discharge). PR.

Adding gates without all 5 steps is rejected. The ritual is pv validate-asserted; bypassing it is mechanical impossible.

Falsification gate IDs

Quick cross-reference. See The 20 gates for full descriptions.

CCPA prefix (this repo's gates)

IDNameStatus
CCPA-001trace_schema_roundtripACTIVE_RUNTIME
CCPA-002replay_determinismACTIVE_RUNTIME
CCPA-003mock_completenessACTIVE_RUNTIME
CCPA-004tool_call_equivalenceACTIVE_RUNTIME
CCPA-005file_mutation_equivalenceACTIVE_RUNTIME
CCPA-006sovereignty_on_replayACTIVE_RUNTIME
CCPA-007corpus_coverageHARD-BLOCKING (M16)
CCPA-008parity_score_boundADVISORY (M230)
CCPA-009ci_main_branch_greenACTIVE_RUNTIME
CCPA-010pmat_comply_100pctACTIVE_RUNTIME
CCPA-011line_coverage_100pctACTIVE_RUNTIME
CCPA-012pv_contract_gate_on_commitACTIVE_RUNTIME
CCPA-013first_recorded_parity_scoreDISCHARGED
CCPA-014os_event_parity_boundACTIVE_RUNTIME
CCPA-015os_trace_output_purityACTIVE_RUNTIME
CCPA-016outcome_parity_boundACTIVE_RUNTIME
CCPA-017project_scale_parity_boundPROPOSED (v1.28.0)
CCPA-018arena_recovery_rate_boundPROPOSED (v1.29.0)
CCPA-019calibration_required_before_verdictPROPOSED (v1.32.0)
CCPA-020contract_compliance_per_turnPROPOSED (v1.32.0)

V1_ prefix (Phase 6 infrastructure gates, live in aprender)

IDNameStatus
V1_001qwen3_moe_serve_dispatch_v1ACTIVE_RUNTIME
V1_002qwen3_moe_sampling_v1ACTIVE_RUNTIME
V1_003qwen3_moe_streaming_sse_v1DISCHARGED (gx10 Blackwell)
V1_004phase_6_bench_non_zero_student_pass_rateOPEN

Status legend

  • PROPOSED — defined, not yet algorithmically asserted
  • ACTIVE_ALGORITHM_LEVEL — algorithmically asserted, no measured discharge
  • ACTIVE_RUNTIME — algorithmically asserted AND measured discharge on file
  • DISCHARGED — empirical claim fully met; gate preserved for historical record but no longer fires
  • HARD-BLOCKING — CI exit-1 on failure (subset of ACTIVE_RUNTIME)
  • ADVISORY — emits warning, doesn't exit-1 (intentional after M230)

Academic basis

CCPA's design draws on several lines of prior work. Each is cited where its idea informs a specific gate or technique.

Distillation framing

Hinton et al., 1503.02531Distilling the Knowledge in a Neural Network

CCPA treats claude as the teacher and apr code as the student. The "knowledge" being distilled is the action stream — sequences of tool calls, not output logits. This generalizes the original logit-distillation framing to the agentic-execution setting.

Metamorphic testing of ML systems

Segura et al., 2208.08227METTLE: Metamorphic Testing of Deep Learning Systems

LLMORPH, 2603.23611Cataloged Metamorphic Relations for NLP

A metamorphic relation says: "if input X maps to output Y, then transformation T(X) should map to f(Y)." CCPA's per-tool equivalence rules are metamorphic relations specialized to action streams:

  • Bash(cmd) and Bash(canonical_form(cmd)) should produce equivalent file-system mutations
  • Write(path, content) and Edit(path, old, new) that produce the same file SHA256 are file-mutation-equivalent
  • etc.

The DriftCategory taxonomy maps onto Segura's metamorphic-violation severity scale.

Differential testing

2207.11976Differential Testing of Deep Learning Frameworks

CCPA is a differential test of apr code against claude — two implementations of the same logical specification (agentic coding), measured by paired-execution divergence. The static path's compute_parity_score IS a differential-testing scoring function.

Function-scale outcome parity

MultiPL-E, 2208.08227 — Cassano et al.

evidence/phase-3/multipl-e-rust-scores.json records the M150 function-scale measurement (n=5, parity=1.0000) using the MultiPL-E-Rust HumanEval subset. The benchmark is unmodified from upstream.

Project-scale Arena

SWE-bench, 2310.06770 — Jimenez et al.

SWE-bench formalized the "can LLMs resolve real GitHub issues" measurement at project-scale. CCPA's Phase 5 corpus is hand-curated in the SWE-bench style (real GitHub-issue Rust fixtures), but smaller (n=5) for operator-coordinated dispatch cost reasons. Phase 6's under-contract regime adds the compliance-cost dimension that SWE-bench doesn't address.

Chaos engineering for LLM systems

2505.03096Chaos Engineering for LLM Systems

CCPA's regression-corpus design (deliberate drift, must-fail) is in the spirit of chaos engineering: introduce a known failure mode and verify the meter catches it. The M196-M224 4-bug stack is the empirical justification for this practice.

Sovereignty / data-residency

No single paper drives the sovereignty gate (CCPA-006). The design is informed by the broader privacy-engineering literature on differential-privacy boundaries and the FedRAMP / HIPAA classes of "data must not leave the trust boundary" guarantees. The Tier3 SovereigntyViolation category formalizes the boundary.

Per-gate mapping

See docs/specifications/academic-basis.md for the per-gate citation table — every gate has a paper that motivated its design or that it specializes.

Milestone history

CCPA's work is organized as a continuous sequence of M-rows (milestone-rows) tracked in docs/specifications/milestones-*.md. Each M-row is one substantive deliverable (a PR, a fixture, a finding) with its own scope and acceptance criteria.

High-level phases

PhaseM-row rangeWhat it shipped
Phase 1 (RECORD) — out-of-scope post-M222M0-M14original HTTPS-proxy recording path; rescoped to subprocess-driver
Phase 2 (REPLAY)M15-M50trace schema, replayer, mock harness, hook+skill projection
Phase 3 (DISTILL — function-scale)M51-M100MultiPL-E-Rust HumanEval bench, function-scale parity measurement (n=5, 1.0000)
Phase 4 (project-scale prep)M101-M150fixture authoring for project-scale; differ enhancements; bidirectional sensitivity
Phase 5 (ARENA — project-scale)M150-M234Arena runner, calibration-and-scale corpus, first arena scores
Phase 6 (UNDER-CONTRACT)M250-M294compliance-enforced dispatch, V1_004 chain, Coder-finetune-distribution finding

Notable M-rows

  • M9 — regression corpus added (bidirectional sensitivity)
  • M15 — schema v2 (hook_event + skill_invocation)
  • M16FALSIFY-CCPA-007 hard-blocking corpus coverage gate
  • M150 — first measured function-scale parity (n=5, 1.0000)
  • M194-M210 — Arena runner Phase 5 P5.1-P5.5
  • M222 — RECORD path out-of-scope directive (rescope to subprocess-driver only)
  • M230FALSIFY-CCPA-008 flipped to ADVISORY after M196-M224 four-bug-stack revealed meter under-sensitivity
  • M234 — Popperian-falsification of static-fixture as project-scale predictor (claude 1/5, apr code 0/5)
  • M236FALSIFY-CCPA-019 (calibration_required_before_verdict) introduced
  • M280 — Phase 6 CCPA project SUSPENSION declared (1.5B model below testability floor)
  • M286 — M32d MoE KV cache shipped (19× speedup; unblocks V1_004)
  • M287 — greedy baseline pattern; uniform driver_error on 30B-Coder
  • M291 — sub-bench B pattern shift; driver_errororacle_failed_after_max_turns
  • M292ArenaOutcome::AgentTextLoop detector (Gap 3 closure)
  • M293PHASE6_MAX_CONSECUTIVE_TEXT_TURNS env var wiring
  • M294 — finetune-distribution A/B; non-Coder Qwen3-30B-A3B-Instruct-2507 confirmed at smoke level

How M-rows are tracked

Each M-row gets a row in docs/specifications/milestones-mNNN-mMMM.md. The row body explains:

  • What was shipped
  • Why (motivation, prior M-row references)
  • Acceptance criteria (tests, evidence, contract entries)
  • Cross-references (PR numbers, evidence file paths)

A doc-drift detector (scripts/check-doc-drift.sh) asserts that the milestone counter on 5 cross-reference surfaces (README, CONTRIBUTING, top spec, status-snapshots, milestones doc) all agree.

Operator-coordinated vs autonomous M-rows

  • Autonomous — anything that doesn't require operator-only data (compute budget, model-class decision, contract amendment). The autonomous ship-cycle (per CLAUDE.md) ships these continuously without check-in.
  • Operator-coordinated — anything that needs operator-only data: dispatching benches, deciding model class, amending contract gates. The substantive→mechanical→substantive cadence pauses ONLY for these.

Glossary

TermDefinition
Action streamThe sequence of tool calls + tool results + text + hooks + skills emitted by an agent during one session. CCPA's primary unit of measurement.
apr codeThe student. A sovereign, pure-Rust CLI coding agent (in paiml/aprender) that runs against a local GGUF model with no data leaving the machine.
apr serveInference server subprocess that apr code auto-spawns and talks to over HTTP. Loads the GGUF model and serves /v1/chat/completions.
ArenaCCPA's live-execution measurement path. Multi-turn live dispatch of real teacher + real student against test-shaped oracles.
CCPAClaude Code Parity for apr code. The harness this book describes.
claudeThe teacher. Anthropic's official CLI (docs). Treated as the orchestrator and the action-stream baseline.
Closed enumA Rust enum where adding a variant requires touching every match site. CCPA's ArenaOutcome, DriftCategory, ToolInvocation are closed enums by design — pattern-match exhaustiveness is the type system's enforcement of total handling.
Compound oraclePhase 6 oracle: cargo test AND pmat comply check --strict both pass.
Compliance-TrapM254 P6.3 detector. Bails the session with ArenaOutcome::ComplianceTrap when the same (file, sha256) pair fails compliance N consecutive turns. Saves token cost.
DriverThe subprocess wrapper around claude (teacher) or apr code (student). SubprocessDriver in crates/ccpa-arena/.
Drift / DriftCategoryA divergence between teacher and student traces. The closed enum (Tier0/1/2/3) categorizes severity.
FalsifierA deterministic test that proves a gate. The gate states a falsifiable claim; the test would FAIL if the claim were wrong.
FALSIFY-CCPA-NNNThe unique identifier of a gate. Each ID maps to one entry in the contract YAML and one (or more) tests in the crates.
FixtureA canonical input — typically meta.toml + (trace pairs OR cwd-tree + prompt + oracle). Lives in fixtures/<corpus>/<id>/.
GreedySampling at temperature=0: always take the argmax of the next-token distribution. Deterministic but boring; can cause infinite loops.
M-rowOne milestone in the project's continuous-ship cadence. Numbered M0, M1, ..., M294, ...
MoEMixture-of-Experts. A neural-architecture pattern where only a fraction of total parameters are "active" per token. Qwen3-Coder-30B-A3B is 30B total / 3B active.
OracleThe test-shaped acceptance check for a fixture. Phase 5: `cargo test 2>&1
pmat complyThe paiml quality-posture meter. A multi-pass static analyzer with org-wide rules (allowed-unwrap, complexity caps, lint rules, doc coverage).
pvThe contract validator. Binary from aprender-contracts-cli. Asserts contract YAML correctness, pin correctness, gate cross-reference correctness. Dogfooded; bash re-implementations rejected.
pv validateThe pv subcommand that hard-asserts the contract YAML schema. CI-gated via FALSIFY-CCPA-012.
pin.lockThe pin from this repo to the canonical aprender contract YAML. Records sha256 + commit reference. Pin-check is part of FALSIFY-CCPA-012.
PROPOSED / ACTIVE_ALGORITHM_LEVEL / ACTIVE_RUNTIMEThe three statuses of a gate. See Status flow.
Recovery rateFraction of OraclePassed fixtures where the agent recovered from at least one non-zero bash exit. Phase 5 metric.
Sovereignty / Tier3The hardest gate class. A Tier3 SovereigntyViolation means the agent did something that breaches data residency / network sovereignty (egress, credential read, foreign API).
Sub-benchA focused dispatch of the Phase 6 bench script with specific knob settings (e.g., sub-bench A = few-shot prompt only, sub-bench B = full 3-knob config).
Tool call / <tool_call> blockA JSON object inside a <tool_call>...</tool_call> XML-like wrapping. apr code's parser extracts these from the model's response and dispatches the named tool.
TurnOne round of (assistant-emits-response, tool dispatched, result observed). The session loop runs up to max_turns of these.
V1_NNNPhase 6 infrastructure gate prefix. Lives in aprender's contracts (distinct from CCPA-NNN).
Wall budget / wall_timeoutThe wall-clock seconds budget for one session. Phase 5 default 900s; Phase 6 default 3600s. WallTimeout is the outcome when exceeded.