CCPA — The Claude Code Parity Harness
A record-replay-distill harness measuring apr code against Claude Code at the action-stream level.
This book is the reference companion to the claude-code-parity-apr repository. It explains the methodology, the falsifier gates, the empirical findings, and the architectural decisions that shape every measurement.
Why this exists
A sovereign, locally-hosted coding agent (apr code) needs an honest, falsifiable yardstick to measure itself against the industry baseline (Claude Code). Without a rigorous yardstick:
- "It works" claims drift from "it works like the reference"
- Regressions hide behind narrative
- The compliance posture of code an agent emits has no contract gate
CCPA closes that gap with three commitments:
-
Contract-first. Every behavior gate (
FALSIFY-CCPA-001..020) is encoded as a falsifiable assertion in a YAML contract before code lands. Tests prove the gate;pv validateproves the contract;pmat complyproves the project's compliance posture. No code ships without a contract. -
Two complementary measurement paths. A static path — authored teacher/student trace pairs scored by a deterministic differ — validates the meter. An Arena path — multi-turn live dispatches of real
claude+ realapr codeagainst real Rust fixtures with test-shaped oracles — validates the system. The two paths cross-falsify each other. -
Empirical calibration. Every Arena verdict requires a fresh bidirectional-sensitivity calibration on file (
FALSIFY-CCPA-019). Static-fixture parity is calibrated against project-scale Arena reality; any drift between them is recorded and explained.
Honest framing
At function-scale (single-prompt code generation on HumanEval-style fixtures), claude and apr code are functionally interchangeable — both pass each other's tests (1.0000 parity, n=5, M150).
At project-scale (multi-turn Arena with real GitHub-issue fixtures), the static-fixture approach is Popperian-falsified as a project-scale predictor: claude solves 1/5, apr code 0/5 on phase-5 corpus (M234). Direction agrees with static verdict, magnitudes diverge.
The empirical chain in this book — M1 → M294 — is the honest record of what we measured, when, and how confident we are. Negative results are evidence; this book treats them as such.
Status as of writing
- Contract v1.32.0 — 20 gates registered (16 ACTIVE_RUNTIME, 4 PROPOSED)
- M0 → M294 all SHIPPED
- Phase 6 under-contract dispatch in active operator-coordinated bench cycles against Qwen3-30B-A3B-Instruct-2507
- V1_004 (Phase 6 non-zero student pass rate) is the open gate
How to read this book
- Want the methodology in 10 minutes? → What is CCPA? + Methodology
- Want to add a fixture or run a bench? → CLI reference
- Want the empirical story (the interesting part)? → V1_004 chain
- Want the academic basis? → Academic basis
License
Apache-2.0 OR MIT. See the repository root.
What is CCPA?
CCPA — the Claude Code Parity for apr code harness — is a measurement system. It does one job: produce a falsifiable, contract-gated parity score between two AI coding agents.
- Teacher (the reference): Claude Code — Anthropic's official CLI, treated as the orchestrator and the action-stream baseline.
- Student (the sovereign system under test):
apr code— a locally-hosted, pure-Rust coding agent that runs against a local GGUF model with no data leaving the machine.
What "parity" means here
Parity is not "the two systems produce identical bytes." Parity is action-stream semantic equivalence under a per-tool rule set.
For each pair of trace records — teacher and student — the differ asks:
- Did they invoke the same logical tool? (
Bash↔Bash,Write↔Write, etc.) - Did the tool inputs differ in ways that matter? (commands semantically equivalent? file paths normalized? content byte-equal or text-equivalent?)
- Did the resulting file-system mutations agree? (hash-checked)
- Did the OS-event trace agree, modulo allowed nondeterminism?
A parity score in [0.0, 1.0] plus a closed enum of DriftCategory for any mismatch is the output. The score and category are mechanically asserted by FALSIFY-CCPA-004 through FALSIFY-CCPA-008.
What CCPA is NOT
- Not a benchmark suite for general LLMs. The corpus is curated for the
apr code↔claudeparity question. SWE-bench, HumanEval, and similar exist for general benchmarking. - Not a record-from-API tool. The original HTTPS-proxy recording path is intentionally out of scope post-M222 directive.
claudeis driven as a subprocess via session-based auth (claude login); CCPA does not useANTHROPIC_API_KEYand does not call the Anthropic API directly. - Not a unit-test framework for
claude. It's a parity harness — the meter between two systems.
Three deliverables, one repository
| Deliverable | What it is | Where it lives |
|---|---|---|
| The differ | ccpa-differ crate + ccpa diff / ccpa corpus CLI | crates/ccpa-differ/ |
| The Arena runner | ccpa-arena crate + ccpa-arena-bench binary | crates/ccpa-arena/ |
| The fixtures | Canonical, regression, project-scale, calibration-and-scale, under-contract | fixtures/ |
All three are governed by one contract YAML — see Methodology.
Methodology — contract-first + falsifier-driven
CCPA is governed by a single methodology, applied uniformly: every behavior gate is an assertion in a YAML contract; the assertion exists before the code that proves it; CI mechanically validates both.
The cycle
1. Behavior identified → written prose
2. Falsifier composed → "this is exactly the assertion that would
prove the gate WRONG if it failed"
3. Contract entry added → contracts/claude-code-parity-apr-v1.yaml
(status: PROPOSED at first)
4. pv validate the contract → syntax + schema gate
5. Test that exercises the falsifier→ crates/ccpa-{differ,arena,...}/tests/
(links the gate ID by name)
6. CI hard-blocks → status flips ACTIVE_ALGORITHM_LEVEL
once the test passes deterministically
7. Empirical evidence on file → flips ACTIVE_RUNTIME once a real
measured discharge is recorded
No step is optional. No step happens in a different order. The cycle is enforced by FALSIFY-CCPA-012 (pre-commit + CI pv validate) and FALSIFY-CCPA-007 (corpus coverage).
Status flow for any gate
PROPOSED ──── algorithm-level test passes deterministically ────→ ACTIVE_ALGORITHM_LEVEL
│
measured discharge on file
▼
ACTIVE_RUNTIME
PROPOSED: defined in the YAML, not yet asserted by a passing test.ACTIVE_ALGORITHM_LEVEL: a deterministic test asserts the gate, but no real-world measurement has been recorded yet.ACTIVE_RUNTIME: a real measured bench run (operator-dispatched, evidence captured) discharged the gate.
See Status flow for the exhaustive transition table.
Three sources of truth
| Concern | Lives in | Why |
|---|---|---|
| Contract YAML | paiml/aprender/contracts/claude-code-parity-apr-v1.yaml (canonical), pinned here via contracts/pin.lock | aprender is the org-wide single-source-of-truth for paiml contracts |
| Spec text | docs/specifications/claude-code-parity-apr-poc.md | This repo since M1 |
| Implementation, fixtures, CI, coverage, pmat-comply | this repo | The harness IS the implementation |
The split mirrors aprender's monorepo policy: aprender stays canonical for contract TEXT (the shared schema across all paiml contracts), while this repo is canonical for runtime ENFORCEMENT (the tests, fixtures, CI, and pmat comply posture).
Forbidden tools
cargo tarpaulin— slow, unreliable. Usecargo llvm-covonly.bashre-implementations ofpv/pmat/cargo-llvm-covchecks — ifpv validaterejects a contract, fix the contract or extendaprender-contracts/src/schema/; do not duplicate validation logic in bash.
Code search policy
pmat query over grep for any Rust code search. pmat query returns quality-annotated, semantically ranked results (TDG grades, complexity, fault patterns). grep / rg returns lines.
grep is acceptable only for non-Rust files (TOML, YAML, Markdown) or quick one-off debugging.
The two measurement paths
CCPA's parity score is the output of two complementary measurement paths that cross-falsify each other.
Path 1 — Static (the meter)
fixtures/canonical/<id>/teacher.ccpa-trace.jsonl ◄── AUTHORED
▲
│ per-tool equivalence rules
│ + hook + skill projections
▼
fixtures/canonical/<id>/student.ccpa-trace.jsonl ◄── AUTHORED
│
▼
ccpa-differ::compute_parity_score
│
▼
ParityReport
{ score, drifts[] }
- What it validates: the meter. Does the differ recognize equivalent actions? Does it catch the kinds of drift we care about? Does it ignore the noise we choose to ignore?
- How it's wired: 30 canonical fixtures + a regression corpus (bidirectional sensitivity proof, M9) + per-PR CI hard-blocker (
FALSIFY-CCPA-007since M16). - What it cannot do: tell you whether
apr codeactually solves real tasks. Trace pairs are AUTHORED; they prove the differ logic, not the real-world capability gap.
Path 2 — Arena (the system)
fixtures/project-scale/<id>/{prompt.txt, cwd-tree/}
│
▼
Arena runner: live claude + live apr code
(multi-turn, max_turns=20, wall=900s default)
│
▼
per-fixture oracle (cargo test 2>&1 | grep "test result: ok")
│
▼
ArenaOutcome
{ OraclePassed | OracleFailedAfterMaxTurns
| WallTimeout | DriverError | ComplianceFailed
| ComplianceTrap | AgentTextLoop (M292) }
│
▼
evidence/phase-{5,6}/arena-scores.json
- What it validates: the system. Does
apr codesolve real Rust bugs the wayclaudedoes? - How it's wired: multi-turn live subprocess dispatch. Operator-coordinated (requires
claude login+ a local GGUF model + GPU/CPU compute budget). Phase 5 (M194-M210) shipped the project-scale corpus; Phase 6 (M250+) adds the under-contract dispatch (per-turnpmat comply check --strictto measure compliance cost). - What it cannot do: tell you that the differ logic is right. Arena measures end-to-end behavior, not action-stream equivalence.
Why both?
Each path has a different failure mode that the other catches:
- Static path alone would let
apr code"pass" by producing traces that look likeclaude's but cover none of the real-world capability surface. A perfect 1.0 parity score on a curated corpus means nothing ifapr codecan't solve a real bug. - Arena path alone would let
apr code"pass" by producing solutions that happen to work but via wildly different action sequences (e.g., a single 5000-line file_write vs. claude's careful read-edit-test loop). Outcome parity ≠ action parity; both matter.
FALSIFY-CCPA-019 (calibration_required_before_verdict) and FALSIFY-CCPA-016 (outcome_parity_bound) jointly enforce that the two paths' verdicts must agree, or the disagreement must be calibrated and explained.
When the paths disagree — the Popperian discipline
The M234 finding (phase-5 results) was a clean Popperian-falsification of the static-fixture approach as a project-scale predictor:
- Static path: 1.0000 parity on canonical corpus (n=30, M150-M161)
- Arena path: claude 1/5, apr code 0/5 on phase-5 project-scale corpus (M234)
Direction agrees (claude > apr code), magnitude diverges (1.0 vs 0.0 on Arena despite 1.0 on static). The static result over-predicts at project-scale. This is recorded in docs/specifications/completeness-assessment.md and the Arena scores are the ground-truth for project-scale claims.
Architecture at a glance
Workspace layout
claude-code-parity-apr/
├── contracts/ # pin.lock + smoke YAML; canonical YAML lives in aprender
├── crates/
│ ├── ccpa-trace/ # JSONL trace schema, types, validators
│ ├── ccpa-differ/ # per-tool equivalence rules, parity score
│ ├── ccpa-recorder/ # stream-json parser (claude side)
│ ├── ccpa-subproc/ # subprocess driver (deterministic stdout/stderr capture)
│ ├── ccpa-replayer/ # mock harness for replay determinism
│ ├── ccpa-arena/ # multi-turn live runner + bench binary
│ └── ccpa-cli/ # `ccpa` user-facing binary
├── docs/specifications/ # 25 spec files (all <500 LOC, doc-drift gated)
├── evidence/ # per-phase measured-output snapshots
├── fixtures/ # canonical, regression, project-scale, calibration-and-scale, under-contract
└── scripts/ # bench dispatch + drift detectors
Crate dependency graph
ccpa-cli
│
┌─────────────┼─────────────┐
▼ ▼ ▼
ccpa-differ ccpa-arena ccpa-recorder
│ │ │
└─────────────┼─────────────┘
▼
ccpa-trace
│
▼
ccpa-subproc
ccpa-trace is the schema root — every crate consumes its Trace, Record, ToolUse, ToolResult types. Adding a new trace record kind goes here first; the schema bump cascades downward through compile-time type checks.
How ccpa diff produces a parity score
- Load both JSONL files via
ccpa-trace::parse::parse_file. The parser hard-enforces schema v2 (hook_event+skill_invocationrecords added at M15). - Pair records by index. Length must match exactly (records imbalance is a hard error — see
tool_call_equivalencefalsifier). - Project hook events and skill invocations onto their target tool record (M15 hook/skill semantics).
- Match each paired record under its per-tool equivalence rule:
Bash: command tokenization + whitelist of allowed nondeterminismWrite/Edit: post-state file SHA256 must agreeRead: path + range + content excerptSkill: invocation site + argumentsHook: trigger + target tool's invocation
- Score: count matches, divide by total. Score ∈ [0.0, 1.0].
- Categorize drifts: any mismatch is classified into a closed
DriftCategoryenum.Tier 0 = no drift; Tier 1 = cosmetic; Tier 2 = semantic; Tier 3 = sovereignty violation(seecrates/ccpa-differ/src/sovereignty.rs). - Report:
ParityReport { score, drifts[] }— JSON-serializable, the unit of measurement.
How ccpa-arena-bench runs a fixture
1. Copy fixture's cwd-tree to /tmp/p6-uc-<fixture>-<side>.<rand>
2. Read prompt.txt
3. Launch driver subprocess:
- teacher: claude --output-format=stream-json --print "<prompt>"
- student: apr code --model=<path> -p "<prompt>" + apr serve auto-spawned
4. Multi-turn loop (max_turns=20 default, wall=900s default):
a. Render history into prompt suffix
b. driver.next_turn(prompt + history) → NextTurn { blocks, stop_reason }
c. Extract first ToolUse block → dispatch in fixture cwd
d. Append TurnRecord to history
e. Every K turns (oracle_check_interval=3 default) OR on EndTurn:
- Run oracle: cargo test 2>&1 | grep "test result: ok"
- Pass → return OraclePassed
f. Phase 6 only: if compliance_enforced, per-Write/Edit run pmat comply check
g. Trap detectors: ComplianceTrap (N consecutive same-(file,sha) failures),
AgentTextLoop (N consecutive text-only turns, M292, opt-in)
5. On max_turns / wall / driver_error / compliance_trap → return the appropriate ArenaOutcome
6. Emit BenchResult JSON to evidence/<phase>/captures/<fixture>/<side>.bench.json
The cleanly-typed outcome enum lets aggregate scoring (recovery_rate, oracle_passed_rate, compliance_cost_ratio) pattern-match without parsing strings.
Two binaries, one config space
ccpa— user-facing CLI for the static path (diff,corpus,coverage,validate)ccpa-arena-bench— Arena dispatcher (operator-coordinated)
Both consume the same Trace/ArenaOutcome types and emit the same JSON shapes downstream tools depend on.
Trace schema
The trace schema is the language CCPA speaks. Everything — the differ, the Arena runner, the replayer — operates on Trace objects: a sequence of Record types each describing one observable action.
The 7 record kinds (schema v2)
| Kind | Fields | When emitted |
|---|---|---|
session_start | session_id, cwd, git_commit | First record of every trace |
user_prompt | text, attachments[] | User-initiated turn |
assistant_turn | text, blocks[], stop_reason | Model response |
tool_result | tool_use_id, content, is_error | Tool execution result |
session_end | reason | Last record (clean shutdown or interrupt) |
hook_event | hook_name, trigger, tool_use_id? | Hook fired (schema v2, M15) |
skill_invocation | skill_name, args | Skill invoked (schema v2, M15) |
assistant_turn.blocks[] is a polymorphic array — each block is one of:
Text { text }— model output textToolUse { id, name, input }— a tool call (Bash,Read,Write,Edit,Glob,Grep,Shell, ...)Thinking { text }— extended thinking (claude-only; optional)
The Rust types are mirrored in crates/ccpa-trace/src/lib.rs; the JSON-schema is in contracts/claude-code-parity-apr-v1.yaml § trace_schema.
File format — JSONL (one record per line)
{"kind":"session_start","session_id":"abc-123","cwd":"/tmp/fixture-0001","git_commit":"deadbeef"}
{"kind":"user_prompt","text":"Fix the failing test."}
{"kind":"assistant_turn","blocks":[{"type":"text","text":"I'll start by reading the file."},{"type":"tool_use","id":"tu_1","name":"Read","input":{"path":"src/lib.rs"}}],"stop_reason":"tool_use"}
{"kind":"tool_result","tool_use_id":"tu_1","content":"<file contents>","is_error":false}
...
{"kind":"session_end","reason":"end_turn"}
JSONL means line-oriented, append-only, streamable. The parser at ccpa-trace::parse::parse_file is O(n) and emits structured errors with line numbers.
Roundtrip falsifier — FALSIFY-CCPA-001
Every record kind has a roundtrip test: serialize → parse → re-serialize → compare. If any field is lossy or any field re-orders, the roundtrip falsifier catches it.
17 pin tests in crates/ccpa-trace/tests/falsify_ccpa_001_roundtrip.rs.
Schema versioning
- v1 (M0-M14): 5 record kinds (
session_start,user_prompt,assistant_turn,tool_result,session_end). - v2 (M15+): adds
hook_eventandskill_invocation. The differ's hook/skill projection rules require these.
Schema bumps follow the Methodology cycle — contract YAML first, then tests, then code.
The differ
ccpa-differ is the heart of the static path. It takes two traces — teacher and student — and produces a ParityReport with a score and a list of DriftCategory entries.
Entry point — compute_parity_score
use ccpa_differ::{compute_parity_score, ParityReport};
use ccpa_trace::Trace;
let teacher: Trace = ccpa_trace::parse_file("teacher.ccpa-trace.jsonl")?;
let student: Trace = ccpa_trace::parse_file("student.ccpa-trace.jsonl")?;
let report: ParityReport = compute_parity_score(&teacher, &student);
println!("score = {}, drifts = {}", report.score, report.drifts.len());
Per-tool equivalence rules
The differ's behavior is dispatched on ToolUse.name:
| Tool | Rule |
|---|---|
Bash / Shell | Tokenize command; whitelist allowed nondeterminism (mktemp -p paths, ISO-8601 timestamps, PID); compare token sequences |
Read | Path equal (after canonicalization) + range overlap; content excerpt SHA256 equal |
Write | Path equal; post-state file SHA256 equal (the file mutation IS the equivalence claim) |
Edit | Path equal; old/new strings equal; post-state file SHA256 equal |
Glob | Pattern equal; result-count equal modulo cwd; result-paths SHA256-equal |
Grep | Pattern equal; flag equivalence; result line-count equal |
Hook | Trigger equal; target tool's invocation equal |
Skill | Name equal; args structurally equal |
Each rule is one Rust function in crates/ccpa-differ/src/; adding a tool requires (1) the rule, (2) a falsifier test, (3) a contract YAML entry.
DriftCategory — the closed enum
pub enum DriftCategory {
Tier0NoDrift,
Tier1Cosmetic { detail: String }, // whitespace, timestamp jitter
Tier2Semantic { detail: String }, // different file content, different command
Tier3SovereigntyViolation { detail: String }, // network egress, foreign-API call
}
Tier3 is the hardest gate. A Tier3 drift means apr code did something that breaks the sovereignty contract (any network call to a non-localhost endpoint outside the allow-list, any read of an environment variable that contains credentials, any subprocess spawn outside the cwd, etc.). Even one Tier3 drift hard-fails CI.
How the score is computed
total_pairs = teacher.records.len() # must equal student.records.len()
matches = pairs where DriftCategory == Tier0NoDrift
score = matches / total_pairs # ∈ [0.0, 1.0]
The threshold for FALSIFY-CCPA-008 (parity_score_bound) is configured in the contract YAML; current canonical-corpus threshold is ≥ 0.95 (with 30 fixtures, this means at most 1 fixture can have any drift).
Corpus driver — ccpa corpus
ccpa corpus fixtures/canonical/ # walks every fixture, computes per-fixture + aggregate score
ccpa corpus fixtures/regression/ # MUST FAIL (bidirectional sensitivity proof)
ccpa corpus fixtures/canonical/ --json # machine-readable for CI
Aggregate scoring respects FALSIFY-CCPA-007 (corpus coverage): every required-row of the apr-code-parity-v1.yaml parity matrix must have at least one fixture exercising it. Missing coverage → exit 2 with a structured error pointing at the gap.
What the differ does NOT do
- Does not run code. It reads two traces; that's it. The Arena runner is for live execution.
- Does not infer intent. "Same effect, different tool" is not equivalence under CCPA. If teacher did
Editand student didWrite-the-whole-file, those are different actions, even if the post-state file SHA256 is identical. The contract gates the action stream, not just the file system. - Does not allow nondeterminism by default. Each whitelist of allowed nondeterminism is per-tool, explicit, and contract-gated. Adding a new whitelist entry requires a contract bump.
Fixtures
CCPA has five distinct fixture corpora, each measuring a different thing.
1. fixtures/canonical/ — the meter
- 30 fixtures, every required-row of
apr-code-parity-v1.yamlexercised at least once. - AUTHORED teacher/student trace pairs.
- MUST score
≥ thresholdinccpa corpus. Per-PR CI hard-blocker viaFALSIFY-CCPA-007. - Aggregate parity = 1.0000 at canonical corpus (M150,
fixtures/canonical/measured-parity.json).
2. fixtures/regression/ — bidirectional sensitivity proof
- Fixtures with deliberate drift — teacher and student diverge in known ways.
- MUST FAIL
ccpa corpus. If a regression fixture passes, the differ has lost sensitivity to that drift class. - Catches "the meter agrees on everything" bugs (M9 introduced this corpus).
3. fixtures/project-scale/ — Phase 5 Arena corpus
- 5 real GitHub-issue Rust fixtures with full
cwd-tree/,prompt.txt, oracle. - Each fixture is a real Rust bug or feature request that an agent must solve in a multi-turn session.
- M234 finding: claude 1/5, apr code 0/5. Direction agrees with static verdict; magnitudes diverge.
4. fixtures/calibration-and-scale/ — synthetic-deterministic project-scale
- 15 hand-authored Rust bug fixtures.
- Deterministic seed; reproducible from clean clone.
- Bridges the static path (controlled) and project-scale Arena (real-world) via a controlled Arena-style measurement.
5. fixtures/under-contract/ — Phase 6 corpus
- 20 fixtures across 4 classes: leetcode, oo (OO patterns), transpile (format converters), unix (CLI utilities).
- Each runs under the Phase 6 compound oracle:
cargo testANDpmat comply check --strict. - The corpus that V1_004 dispatches against.
Fixture file layout
fixtures/canonical/0001-edit-readme/
├── meta.toml # fixture id, covers[], description
├── teacher.ccpa-trace.jsonl # AUTHORED teacher action stream
└── student.ccpa-trace.jsonl # AUTHORED student action stream
fixtures/under-contract/leetcode/01-two-sum/
├── prompt.txt # the task description shown to both agents
├── meta.toml # oracle_cmd, expected_pattern
└── cwd-tree/
├── Cargo.toml
├── src/lib.rs # the buggy code
└── tests/...
Adding a fixture
mkdir fixtures/canonical/00XX-my-scenario
cat > fixtures/canonical/00XX-my-scenario/meta.toml <<EOF
[fixture]
id = "00XX-my-scenario"
covers = ["builtin-tools-rwegs"]
description = "What this fixture exercises and why."
EOF
# Author teacher.ccpa-trace.jsonl + student.ccpa-trace.jsonl
ccpa corpus fixtures/canonical/ # MUST exit 0
ccpa coverage --apr-code-parity-yaml ... --oos-rows ... # MUST exit 0
make tier3 # full local gate sweep
Coverage gates fail if a fixture is added without a covers[] claim or if covers[] contains a row not in apr-code-parity-v1.yaml. The contract YAML drives fixture validation, not the other way around.
Bidirectional sensitivity
A parity meter has two failure modes:
- False positive — declaring drift when traces are actually equivalent. Caught by the canonical corpus (
fixtures/canonical/MUST PASS). - False negative — declaring equivalence when traces actually diverge. Caught by the regression corpus (
fixtures/regression/MUST FAIL).
A meter that passes only the canonical corpus is not validated. It may be passing everything trivially. The regression corpus is the falsifier for the differ itself.
What "bidirectional" means here
The differ must be sensitive in both directions:
teacher == student (equivalent)
│
▼
parity_score == 1.0
│
(canonical corpus
proves this direction)
teacher != student (deliberate drift)
│
▼
parity_score < threshold
│
(regression corpus
proves this direction)
If either direction breaks, the meter is broken. The regression corpus exists because in M9 we caught a class of drift the differ wasn't sensitive to — the canonical corpus passed, but a known-bad pair also passed. That's a Tier 2 meter bug. Bidirectional sensitivity is the falsifier for it.
The M196-M224 bug stack
Through M196-M224 the team encountered four meter bugs in a row, each caught only by bidirectional sensitivity:
- Bash command tokenization —
cargo test --releaseandcargo testtokenized identically (the regression fixture for this case exposed it). - Glob result-set hashing —
glob.results[]was being compared as a set, not a sequence, allowing reordered results to slip through. - Hook trigger projection —
PreToolUseandPostToolUsehooks were collapsing onto the same target. - Sovereignty check ordering —
Tier3detection ran AFTER score computation, so a sovereignty violation could silently lower the score below threshold without being categorically flagged.
Each was caught by a regression fixture that the canonical corpus didn't catch. The four-bug stack is the empirical justification for FALSIFY-CCPA-019 (calibration_required_before_verdict) — every Arena verdict requires a fresh bidirectional sensitivity record on file.
The calibration contract — FALSIFY-CCPA-019
Shipped at M236. Codifies the M196-M224 lesson as a permanent gate:
no Arena verdict ships without a CalibrationRecord stamped within the last 90 days
The CalibrationRecord JSON shape lives in crates/ccpa-differ/src/calibration.rs. Each record contains: (a) canonical-corpus passes, (b) regression-corpus fails, (c) Tier3 sovereignty exercises, (d) cross-tool equivalence spot-checks. A stale record fails CI on the next Arena dispatch.
This is the only FALSIFY-CCPA- gate that fires on a measured artifact (a JSON file with a timestamp), not on a code-level test. It's the closest thing CCPA has to a runtime-only contract — and it's there for a hard-earned reason.
Arena runner overview
The Arena is CCPA's live-execution path. It dispatches real claude and real apr code subprocesses against real Rust bugs in real cwd-trees, and scores each via a test-shaped oracle.
The Arena loop (per fixture, per side)
1. Copy fixture's cwd-tree to /tmp/p6-uc-<fixture>-<side>.<rand>
2. Read prompt.txt
3. Launch driver subprocess via SubprocessDriver:
teacher: claude --output-format=stream-json --print "<prompt>"
student: apr code --model=<path> -p "<prompt>" (apr serve auto-spawned)
4. Multi-turn ArenaSession::run loop:
for turn in 1..=max_turns:
a. Check wall-clock budget
b. Render history into prompt suffix:
"<prompt>\n\n<rendered_history>### Continue:\n"
c. driver.next_turn(prompt) → NextTurn { blocks, stop_reason }
d. Extract first ToolUse block from blocks:
some → dispatch the tool in cwd, record ToolResult
none → record ToolInvocation::Text
e. Phase 6 only: ComplianceTrap detector observes ToolResult::FileMutated
f. M292: AgentTextLoop detector observes ToolInvocation::Text
g. Append TurnRecord to history
h. Every oracle_check_interval turns OR on StopReason::EndTurn:
run_oracle_compound → OracleOutcome { Passed | FailedDueToCompliance | NonZeroExit | ExitZeroNoPatternMatch }
Passed → return ArenaOutcome::OraclePassed
FailedDueToCompliance (Phase 6) → return ArenaOutcome::ComplianceFailed
end for
5. Loop exit → ArenaOutcome::OracleFailedAfterMaxTurns
6. Wall-time exit → ArenaOutcome::WallTimeout
7. Driver error → ArenaOutcome::DriverError { reason, turns_before_error }
8. Compliance trap → ArenaOutcome::ComplianceTrap { file, last_reason, consecutive_count }
9. Text loop (M292) → ArenaOutcome::AgentTextLoop { consecutive_text_turns, last_text_excerpt }
Default knobs
| Knob | Default | Set by |
|---|---|---|
max_turns | 20 | PHASE6_MAX_TURNS env / --max-turns flag |
max_wall_seconds | 900 (phase 5) / 3600 (phase 6) | PHASE6_WALL_SECONDS / --wall-seconds |
oracle_check_interval | 5 (phase 5) / 3 (phase 6) | PHASE6_ORACLE_INTERVAL / --oracle-check-interval |
compliance_enforced | false (phase 5) / true (phase 6) | PHASE6_COMPLIANCE_ENFORCED / --compliance-enforced |
max_consecutive_compliance_failures | 3 | PHASE6_MAX_CONSECUTIVE_COMPLIANCE_FAILURES |
max_consecutive_text_turns (M292) | 0 (disabled) | PHASE6_MAX_CONSECUTIVE_TEXT_TURNS |
Two binaries
ccpa-arena-bench(incrates/ccpa-arena/src/bin/) — one-fixture dispatcher. Reads prompt, oracle config, driver config from flags; emits BenchResult JSON.scripts/phase-{5,6}-bench.sh— corpus-walker that callsccpa-arena-benchper-fixture, aggregates per-sidescores.json.
The shell script handles model pre-warming, evidence directory layout, and per-fixture orchestration; the Rust binary handles the loop.
Phase 5 — project-scale Arena
Phase 5 (M194-M210) was the first Arena dispatch against real GitHub-issue Rust fixtures. It produced the Popperian-falsification result that established project-scale measurement as the ground truth.
Corpus
fixtures/project-scale/ — 5 real Rust bug fixtures hand-curated from GitHub issues:
- Each fixture has a
cwd-tree/(a snapshot of the repo at the buggy commit), aprompt.txt(the issue text or a derived task), and a test-shaped oracle (cargo test+ an expected pattern). - Fixtures span error-handling, async edge cases, FFI boundaries, lifetime issues, and macro-related bugs.
Headline result (M234)
| Side | Oracle pass | Recovery (one bash-fail then pass) | Recovery rate |
|---|---|---|---|
claude (teacher) | 1/5 | 1 | 1.00 (1 of 1 passes had recovery) |
apr code (student) | 0/5 | 0 | undefined (0/0) |
apr code's 0/5 was uniform OracleFailedAfterMaxTurns — the agent engaged but couldn't solve the bugs within the 20-turn / 900s budget.
What M234 falsified
The static-fixture parity score of 1.0000 on the canonical corpus (fixtures/canonical/, n=30, M150) does NOT predict project-scale Arena performance. The two systems are functionally interchangeable on single-prompt code generation (HumanEval-class) but diverge on multi-turn project-scale work.
Per the Popperian discipline, this is a clean falsification, not a contradiction. Both measurements are valid; they measure different things. The static path measures the meter; the Arena path measures the system.
docs/specifications/completeness-assessment.md is the honest record of this. The README's "honest framing" paragraph quotes the same finding.
Why the Arena bench is operator-coordinated
A full Arena run consumes:
claudeAPI costs (one paidclaude --printinvocation per turn × up to 20 turns × 5 fixtures × 2 dispatches per measurement)- Local GPU/CPU compute for
apr code'sapr serve(GGUF model loaded into VRAM/RAM) - A
claude loginsession that must not be reused across machines or breached by intermediate proxies
These costs are externalized — CI dispatches static-path tests only. Arena dispatches are operator-dispatched, evidence-captured, and stamped into evidence/phase-5/arena-scores.json. This is contract-gated by FALSIFY-CCPA-019 (calibration_required_before_verdict).
Sub-deliverables (P5.1-P5.5)
- P5.1 (M194-M196) —
ArenaSessionscaffolding type - P5.2 (M197-M210) — multi-turn loop body, tool dispatch, oracle integration, MockDriver for tests
- P5.3 (M211-M222) — corpus walker (
ccpa-arena-bench), aggregate scoring, recovery_rate - P5.4 (M223-M228) — bidirectional sensitivity calibration + the M196-M224 4-bug stack closure
- P5.5 (M229-M234) — first end-to-end Arena dispatch + scores.json + Popperian-falsification finding
Phase 6 — under-contract dispatch
Phase 6 (M250+) extends the Arena to measure not just "did the agent solve the bug?" but "did the agent solve the bug in a compliance-respecting way?"
What "under contract" means
In Phase 5, the only oracle is cargo test. An agent can pass that oracle while emitting code that violates pmat comply check --strict (the project's quality posture: complexity caps, lint rules, allowed-unwrap policy, etc.).
In Phase 6, the oracle is compound:
oracle_passed iff (cargo_test_exit_code == 0
AND grep "test result: ok" in test output
AND pmat comply check --strict exit_code == 0)
pmat comply runs at the end of the session AND after every Write / Edit if --compliance-enforced is set (per-turn compliance gating).
The four Phase-6-specific outcomes
| Outcome | When |
|---|---|
ComplianceFailed { check, turn } | Cargo test passed, but final-state compliance check rejected. Distinct from OracleFailedAfterMaxTurns. |
ComplianceTrap { file, last_reason, consecutive_count } | Same (file, sha256) failed compliance N turns in a row (default 3). Saves token cost. |
AgentTextLoop { consecutive_text_turns, last_text_excerpt } (M292) | N consecutive text-only turns (no tool_call). Opt-in via PHASE6_MAX_CONSECUTIVE_TEXT_TURNS > 0. |
OraclePassed (Phase 6 sense) | BOTH cargo test AND pmat comply check --strict pass. |
The V1 falsifiers added at Phase 6
| ID | Name | Status | Asserted by |
|---|---|---|---|
V1_001 | qwen3_moe_serve_dispatch_v1 | ACTIVE_RUNTIME | aprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs |
V1_002 | qwen3_moe_sampling_v1 | ACTIVE_RUNTIME | sampling integration tests |
V1_003 | qwen3_moe_streaming_sse_v1 | DISCHARGED on gx10 Blackwell | streaming SSE test + evidence |
V1_004 | phase_6_bench_non_zero_student_pass_rate | open | per-fixture student_pass_rate > 0 |
Current state of V1_004
V1_004 is the OPEN gate. The bar: "ANY single Phase 6 fixture passes the compound oracle on the student side."
The M286-M294 chain has shipped 6 aprender PRs + 4 CCPA PRs working toward V1_004 discharge:
- M286 — M32d MoE KV cache (19× speedup; the load-bearing inference infrastructure)
- M287 — greedy baseline confirms M287 driver_error pattern (model entered "Human:" infinite loop)
- M288-M290 — diagnosed 3 root causes; shipped sampling (temperature/top_k/top_p), repetition penalty, EOS stop_token, clean_chat_output, few-shot CODE_SYSTEM_PROMPT
- M291 — sub-bench B on Qwen3-Coder-30B-A3B with all fixes: pattern shifted from
driver_errortooracle_failed_after_max_turnswithtool_use_count: 0 - M292 —
ArenaOutcome::AgentTextLoopdetector + opt-in cap (Gap 3 closure) - M293 —
PHASE6_MAX_CONSECUTIVE_TEXT_TURNSenv var wiring - M294 — scope doc for the non-Coder Qwen3-30B-A3B-Instruct-2507 A/B
See The V1_004 chain for the empirical narrative.
Phase 6 corpus — fixtures/under-contract/
20 fixtures across 4 classes:
- leetcode (5) — algorithmic bugs: two_sum, valid-parentheses, longest-common-prefix, merge-sorted-arrays, binary-search
- oo (5) — object-oriented Rust patterns: bank-account, library-borrowing, shape-hierarchy, observer-pattern, builder-pattern
- transpile (5) — format converters: json-to-toml, csv-to-jsonl, markdown-to-html, ini-to-yaml, regex-to-glob
- unix (5) — CLI utility reimplementations: wc, head, tail, cut, sort
Each fixture's meta.toml includes oracle_cmd = "cargo test 2>&1" and expected_pattern = "test result: ok". The compound oracle adds pmat comply check --strict on the post-mutation tree.
Outcome variants
ArenaOutcome is the closed enum capturing every way an Arena session can end. It's the unit aggregate scoring pattern-matches on.
The full enum (post-M292)
#[serde(tag = "kind", rename_all = "snake_case")]
pub enum ArenaOutcome {
OraclePassed { turns: u32, wall_seconds: u64 },
OracleFailedAfterMaxTurns { turns: u32, partial_pass_rate: Option<f64> },
WallTimeout { turns_at_timeout: u32, max_wall_seconds: u64 },
DriverError { reason: String, turns_before_error: u32 },
ComplianceFailed { check: ComplianceCheck, turn: u32 },
ComplianceTrap { file: String, last_reason: String, consecutive_count: u32 },
AgentTextLoop { consecutive_text_turns: u32, last_text_excerpt: String },
}
Decision matrix
| Outcome | Means | What aggregate score should treat it as |
|---|---|---|
OraclePassed | Agent fully solved the fixture. (Phase 6: AND compliance passed.) | oracle_passed = true |
OracleFailedAfterMaxTurns | Agent engaged, but didn't solve within 20 turns. | oracle_passed = false |
WallTimeout | Agent ran out of wall-clock budget mid-session. | oracle_passed = false |
DriverError | Driver subprocess crashed / hung / lost connection. | oracle_passed = false, count as infrastructure failure |
ComplianceFailed (Phase 6) | cargo test passed, pmat comply check rejected. | oracle_passed = false, count toward compliance_cost_ratio numerator |
ComplianceTrap (Phase 6) | Same (file, sha256) failed N consecutive turns. | oracle_passed = false, count toward token-cost-avoidance |
AgentTextLoop (M292, opt-in) | N consecutive text-only turns (no tool_call). | oracle_passed = false, agent didn't engage |
Why this many variants
Each variant captures a distinct failure mode that the team has empirically observed and decided is worth distinguishing. Conflating them loses signal:
OracleFailedAfterMaxTurnssays "the agent worked but produced wrong output." Diagnostic action: inspect history for off-by-one fixes, missing edge cases.WallTimeoutsays "the agent worked too slowly." Diagnostic action: check inference tok/s, max_tokens cap, network latency.DriverErrorsays "the infrastructure broke." Diagnostic action: check apr serve crash logs, network, ports, GPU OOM.ComplianceTrapsays "the agent is stuck making the same violating edit." Diagnostic action: check whether the agent has the compliance rules in context.AgentTextLoopsays "the agent talked but didn't act." Diagnostic action: check tool_call format adherence (this is the M291 finding signature).
Before M292, all the "talked but didn't act" cases were OracleFailedAfterMaxTurns — conflated with "did real work but wrong answer." Adding the AgentTextLoop variant let us measure the difference cleanly.
How aggregate scoring uses outcomes
fn passed(&self) -> bool {
matches!(self, Self::OraclePassed { .. })
}
fn compliance_failed(&self) -> bool {
matches!(self,
Self::ComplianceFailed { .. } | Self::ComplianceTrap { .. }
)
}
recovery_rate (Phase 5) counts OraclePassed fixtures where the agent recovered from at least one non-zero exit. compliance_cost_ratio (Phase 6) is compliance_failed_under_contract / oracle_passed_baseline (i.e., what fraction of fixtures that pass uncontract'd would fail under-contract).
The 20 falsification gates
Every gate is encoded in contracts/claude-code-parity-apr-v1.yaml (canonical in aprender, pinned here via contracts/pin.lock). Every gate has:
- A
FALSIFY-CCPA-NNNID - A short name
- A status (
PROPOSED/ACTIVE_ALGORITHM_LEVEL/ACTIVE_RUNTIME) - A test (or tests) that asserts the falsifier
- A natural-language description of what would falsify the gate
Full table — 20 gates
Source-of-truth invariants (M0+)
| ID | Name | Status | Mechanism |
|---|---|---|---|
| CCPA-009 | ci_main_branch_green | ACTIVE_RUNTIME | branch protection requires ci/gate |
| CCPA-010 | pmat_comply_100pct | ACTIVE_RUNTIME | pmat comply check: is_compliant=true ∧ 0 Fail checks |
| CCPA-011 | line_coverage_100pct | ACTIVE_RUNTIME | cargo llvm-cov: 100% functions ∧ ≥99% lines |
| CCPA-012 | pv_contract_gate_on_commit | ACTIVE_RUNTIME | pre-commit hook + CI pv validate + pin-check |
Behavioral parity gates
| ID | Name | Status | Asserted by |
|---|---|---|---|
| CCPA-001 | trace_schema_roundtrip | ACTIVE_RUNTIME | crates/ccpa-trace/tests/falsify_ccpa_001_roundtrip.rs (17 tests) |
| CCPA-002 | replay_determinism | ACTIVE_RUNTIME | crates/ccpa-replayer/ (16 tests) |
| CCPA-003 | mock_completeness | ACTIVE_RUNTIME | same harness |
| CCPA-004 | tool_call_equivalence | ACTIVE_RUNTIME | crates/ccpa-differ/tests/falsify_ccpa_004_tool_equivalence.rs (36 tests) |
| CCPA-005 | file_mutation_equivalence | ACTIVE_RUNTIME | crates/ccpa-differ/tests/falsify_ccpa_005_file_mutation.rs (15 tests) |
| CCPA-006 | sovereignty_on_replay | ACTIVE_RUNTIME | crates/ccpa-differ/tests/falsify_ccpa_006_sovereignty.rs (10 tests) |
| CCPA-007 | corpus_coverage | HARD-BLOCKING (M16) | tests + CI ccpa coverage --oos-rows ... |
| CCPA-008 | parity_score_bound | ADVISORY (M230) | crates/ccpa-differ/tests/falsify_ccpa_008_parity_score.rs (24 tests) |
| CCPA-013 | first_recorded_parity_score | DISCHARGED | fixtures/canonical/measured-parity.json (n=30, aggregate=1.0000) |
| CCPA-014 | os_event_parity_bound | ACTIVE_RUNTIME | crates/ccpa-differ/tests/falsify_ccpa_014_os_event_parity.rs |
| CCPA-015 | os_trace_output_purity | ACTIVE_RUNTIME | crates/ccpa-subproc/tests/falsify_ccpa_015_output_purity.rs |
| CCPA-016 | outcome_parity_bound | ACTIVE_RUNTIME | crates/ccpa-differ/tests/falsify_ccpa_016_outcome_parity.rs |
| CCPA-017 | project_scale_parity_bound | PROPOSED (v1.28.0) | crates/ccpa-differ/tests/falsify_ccpa_017_project_scale_parity.rs |
| CCPA-018 | arena_recovery_rate_bound | PROPOSED (v1.29.0) | crates/ccpa-arena/tests/falsify_ccpa_018_arena_recovery_rate.rs |
| CCPA-019 | calibration_required_before_verdict | PROPOSED (v1.32.0) | crates/ccpa-differ/tests/falsify_ccpa_019_calibration.rs |
| CCPA-020 | contract_compliance_per_turn | PROPOSED (v1.32.0) | crates/ccpa-arena/tests/falsify_ccpa_020_contract_compliance.rs |
Cross-reference per chapter
- Source-of-truth invariants — the four M0+ gates that govern the project's own quality posture
- Behavioral parity gates — the gates that govern what
apr code↔claudeparity means - Status flow — the PROPOSED → ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME transition table
Mechanically asserted
Every gate is enforced by pv validate per CLAUDE.md § "DOGFOOD pv, NEVER bash". pv is the dogfooded contract validator (binary from aprender-contracts-cli). Re-implementing what pv already does in bash/python is muda and is rejected. If pv validate rejects a contract, fix the contract or extend aprender-contracts/src/schema/.
Source-of-truth invariants
These four gates govern the project's OWN quality posture (not the claude ↔ apr code parity). They are the meta-gates that make the rest of the gates trustable.
CCPA-009 — ci_main_branch_green
What it asserts: every commit on main was produced by a PR that had a green CI run.
How it's enforced: GitHub branch protection on main requires the ci/gate check. Direct pushes to main are blocked. Force-pushes to main are blocked. Merges require either fast-forward from a green branch OR squash from an approved + green PR.
What would falsify: a commit on main without a green CI run.
CCPA-010 — pmat_comply_100pct
What it asserts: every commit on main has pmat comply check returning is_compliant=true AND zero Fail-status checks.
How it's enforced: pmat comply check runs in CI on every PR. Any non-compliant artifact (file with disallowed unwrap, complexity > cap, lint violation, etc.) fails the job.
What would falsify: a main-branch commit where pmat comply check reports any Fail-status check.
pmat comply is the project's quality posture meter. It's not just clippy — it's a multi-pass static analyzer with custom rules for the aprender org's conventions (allowed-unwrap categories, complexity caps, doc-coverage minimums, etc.).
CCPA-011 — line_coverage_100pct
What it asserts: 100% function coverage AND ≥99% line coverage across all crates.
How it's enforced: cargo llvm-cov in CI. The threshold was refined in v0.4.0 (M120) from "100% lines" to "100% functions AND ≥99% lines" — the relaxation acknowledges unreachable error-handling branches that are mechanically uncoverable.
What would falsify: a main-branch commit where cargo llvm-cov reports any function with 0% coverage OR line coverage below 99%.
CCPA-012 — pv_contract_gate_on_commit
What it asserts: every commit on main passed pv validate against the pinned contract YAML AND the contracts/pin.lock matches the canonical aprender source.
How it's enforced: a pre-commit hook (scripts/install-pv-hook.sh, hard-installed by make install-hooks) PLUS the CI pv validate job. Both must pass before merge.
What would falsify: a main-branch commit where pv validate rejects the contract YAML OR where contracts/pin.lock's sha256 doesn't match the aprender commit's contract YAML at the pinned commit.
Why these four
These are the trust roots of the rest of the gate hierarchy. If CCPA-009 fails, any other gate could be silently broken on main without notice. If CCPA-010 fails, the project's quality posture has drifted from the org's contract. If CCPA-011 fails, untested code is on main. If CCPA-012 fails, the contract YAML and the code are out of sync.
Per CLAUDE.md, these are the gates that "no code ships without."
Behavioral parity gates
These gates govern what apr code ↔ claude parity means. Each one is a falsifiable assertion about the action-stream equivalence between the two systems.
CCPA-001 — trace_schema_roundtrip
Asserts: every trace record kind serializes → parses → re-serializes → equals the original.
Why: a lossy schema would silently drop information that downstream parity computation depends on. Catches schema-bump regressions.
Tests: 17 pin tests in crates/ccpa-trace/tests/falsify_ccpa_001_roundtrip.rs.
CCPA-002 — replay_determinism
Asserts: replaying a recorded trace through ccpa-replayer::MockHarness produces byte-identical output across runs.
Why: nondeterminism in the replay path would invalidate any parity claim. Catches hidden time/random/PID dependencies.
Tests: 16 tests in crates/ccpa-replayer/.
CCPA-003 — mock_completeness
Asserts: the MockHarness covers every tool kind defined in the schema.
Why: an incomplete mock means some real-world traces can't be replayed. Catches gaps when new tools are added.
CCPA-004 — tool_call_equivalence
Asserts: per-tool equivalence rules are deterministic, total functions over (teacher.input, student.input) pairs.
Why: the heart of the parity score. If the equivalence rule for Bash (say) has a bug, the score is meaningless.
Tests: 36 tests in crates/ccpa-differ/tests/falsify_ccpa_004_tool_equivalence.rs. One test per (tool, equivalence-class) pair.
CCPA-005 — file_mutation_equivalence
Asserts: a Write and an Edit that produce the same post-state file SHA256 are equivalent at the file-mutation level.
Why: enables the differ to recognize "same effect, different tool" as equivalent at the file level (separately from the action-stream level).
Tests: 15 tests in crates/ccpa-differ/tests/falsify_ccpa_005_file_mutation.rs.
CCPA-006 — sovereignty_on_replay
Asserts: Tier3 SovereigntyViolation fires deterministically on any trace that performs a network egress to a non-localhost endpoint outside the allow-list, OR reads a credential-bearing env var.
Why: the sovereignty contract is the hardest gate. False negatives here are catastrophic.
Tests: 10 tests in crates/ccpa-differ/tests/falsify_ccpa_006_sovereignty.rs.
CCPA-007 — corpus_coverage (HARD-BLOCKING since M16)
Asserts: every required-row of apr-code-parity-v1.yaml has at least one fixture exercising it.
Why: prevents the meter from being valid on a curated subset of the parity surface only. New rows in apr-code-parity-v1.yaml MUST come with a fixture.
Tests: 15 tests + per-PR CI ccpa coverage --apr-code-parity-yaml ... --oos-rows ....
CCPA-008 — parity_score_bound (ADVISORY, M230)
Asserts: canonical corpus aggregate parity score ≥ threshold (currently ≥ 0.95).
Why: the differ's output IS the parity score; this is the corpus-level acceptance bound.
Status: ADVISORY since M230 — the threshold was relaxed because of the M196-M224 4-bug stack revealed that "always 1.0 on canonical" was actually evidence of meter under-sensitivity, not perfect performance.
Tests: 24 tests in crates/ccpa-differ/tests/falsify_ccpa_008_parity_score.rs.
CCPA-013 — first_recorded_parity_score (DISCHARGED)
Asserts: a first measured aggregate parity score on the canonical corpus exists, dated, with n and aggregate recorded.
Status: DISCHARGED. fixtures/canonical/measured-parity.json (n=30, aggregate=1.0000).
CCPA-014 — os_event_parity_bound
Asserts: OS-level events (file opens, process spawns, stat calls) recorded on teacher and student match, modulo allowed nondeterminism whitelist.
Why: catches "same tool input, different OS effects" drift.
CCPA-015 — os_trace_output_purity
Asserts: subprocess stdout/stderr captures are byte-pure (no PID injection, no timestamp jitter introduced by the capture machinery).
Why: if the capture itself adds nondeterminism, every downstream comparison is wrong.
CCPA-016 — outcome_parity_bound
Asserts: per-fixture oracle_passed outcomes agree at corpus-level rate ≥ threshold.
Why: outcome parity (did both systems solve the bug?) is the project-scale analog of action parity. Necessary for the M234 Popperian-falsification claim to be sharp.
CCPA-017 — project_scale_parity_bound (PROPOSED, v1.28.0)
Asserts: project-scale Arena verdict on phase-5 corpus must match the static-fixture verdict in direction (not magnitude).
Why: M234 showed magnitudes diverge (1.0 vs 0.0 / 0.0); direction agreement (claude > apr code) is the falsifiable part.
CCPA-018 — arena_recovery_rate_bound (PROPOSED, v1.29.0)
Asserts: apr code recovery_rate (fraction of OraclePassed fixtures with at least one non-zero exit recovered) bounded below by threshold.
Why: a 0% recovery rate signals the agent doesn't retry meaningfully; threshold gate codifies the expectation.
CCPA-019 — calibration_required_before_verdict (PROPOSED, v1.32.0)
Asserts: no Arena verdict ships without a fresh CalibrationRecord (≤90 days old) on file.
Why: codifies M196-M224 four-bug stack lesson. See Bidirectional sensitivity.
CCPA-020 — contract_compliance_per_turn (PROPOSED, v1.32.0)
Asserts: in Phase 6 dispatch, per-turn pmat comply check fires after every Write/Edit; the agent SEES compliance results in next-turn history.
Why: makes the under-contract regime mechanically distinguishable from the control regime. Without this gate, "under contract" could silently degrade to "same as control."
Status flow — PROPOSED → ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME
Every gate has a status. The status reflects the strength of the evidence that the gate is correctly asserting what it claims.
The three statuses
PROPOSED
- The gate is defined in the contract YAML.
- No test asserts it yet (or tests exist but don't pass deterministically).
- A grep + structural search confirms the gate has a body in the YAML, but the assertion is not yet mechanical.
- CI may print "WARNING: gate-X is PROPOSED" but does not block on it.
ACTIVE_ALGORITHM_LEVEL
- A deterministic, repeatable test asserts the gate.
- The test passes on every CI run.
- But no measured discharge has been recorded — i.e., no operator has dispatched a real bench against real systems and stamped the result into
evidence/. - The gate is algorithm-validated but not empirically validated.
ACTIVE_RUNTIME
- A measured discharge exists in
evidence/with a date, ann, and an aggregate score. - The gate is now both algorithm-validated AND empirically validated.
- This is the highest status; gates that reach
ACTIVE_RUNTIMEare the project's hardest evidence.
Transition rules
+-------------+
| PROPOSED |
+------+------+
|
| (1) write a falsifier test
| (2) test passes deterministically on CI
| (3) flip status in contract YAML
▼
+-------------------------+
| ACTIVE_ALGORITHM_LEVEL |
+------------+------------+
|
| (1) operator dispatches a real bench
| (2) evidence/<phase>/<artifact>.json captured
| (3) calibration record on file (CCPA-019)
| (4) flip status in contract YAML
▼
+----------------+
| ACTIVE_RUNTIME |
+----------------+
Every transition is a YAML-level edit reviewed in PR, gated by pv validate, and asserted by FALSIFY-CCPA-012 (pv_contract_gate_on_commit).
Status distribution at v1.32.0
| Status | Count | Gates |
|---|---|---|
ACTIVE_RUNTIME | 16 | CCPA-001..006, 008..016 (minus DISCHARGED), 009..012 |
PROPOSED | 4 | CCPA-017, 018, 019, 020 |
DISCHARGED | 1 | CCPA-013 (first_recorded_parity_score, M150) |
DISCHARGED is the terminal state — the gate's claim was empirically met, and the gate-as-assertion is preserved for historical record but no longer fires.
The V1_ gate prefix (Phase 6)
V1_001..V1_004 are distinct from CCPA-NNN. They live in aprender's contracts (qwen3_moe-serve-dispatch-v1.yaml et al.) and gate the infrastructure that V1_004 (Phase 6 student pass rate) depends on:
- V1_001 — qwen3_moe serve dispatch (ACTIVE_RUNTIME)
- V1_002 — sampling (temperature/top_k/top_p) (ACTIVE_RUNTIME)
- V1_003 — streaming SSE (DISCHARGED on gx10 Blackwell)
- V1_004 — Phase 6 non-zero student pass rate (open as of this writing)
Once V1_004 discharges, CCPA-017 (project_scale_parity_bound) becomes eligible to flip from PROPOSED to ACTIVE_ALGORITHM_LEVEL.
The V1_004 chain
V1_004 — "Phase 6 bench non-zero student pass rate against a Qwen3-Coder-30B-A3B-Instruct GGUF" — is the open gate. The chain of work toward discharging it has produced the most empirically interesting body of findings in CCPA's history.
This chapter is the canonical record of that chain.
The chain at a glance
| M-row | Date (2026) | What it shipped |
|---|---|---|
| M280 | 05-19 | Phase 6 SUSPENSION declared (1.5B model below testability floor) |
| M286 | 05-20 | M32d MoE KV cache shipped (19× speedup on Qwen3-MoE) |
| M287 | 05-20 | Greedy baseline: uniform driver_error ("Human:" infinite loop) |
| M288 | 05-20 | Diagnosis: 3 root causes (no EOS stop_token, no clean_chat_output, no few-shot prompt) |
| M289 | 05-20 | Plumbing shipped: 3-knob HTTP wire-up (APR_AGENT_TEMPERATURE, etc.) |
| M290 | 05-20 | 5-PR snapshot: aprender#1832, #1837, #1842, #1844, #1846 all merged |
| M291 | 05-21 | sub-bench B pattern shift: driver_error → oracle_failed_after_max_turns (text-only loops, 0 tool_calls) |
| M292 | 05-21 | ArenaOutcome::AgentTextLoop detector + 7 tests (Gap 3 closure) |
| M293 | 05-21 | PHASE6_MAX_CONSECUTIVE_TEXT_TURNS env var wiring at script level |
| M294 | 05-22 | Scope doc for non-Coder Qwen3-30B-A3B-Instruct-2507 A/B; download + smoke confirmed tool_call JSON emission |
The hypothesis-evolution narrative
Hypothesis 1 (start of chain): inference stack is the bottleneck
Premise: V1_004 can't discharge because the apr serve inference path for qwen3_moe is too slow / too broken to fit 20 turns × 1024 max_tokens within a 60min wall budget.
Test: ship M32d MoE KV cache (19× speedup), enable 3-knob sampling, add EOS stop_token and clean_chat_output post-strip.
Result: the M287 driver_error pattern (infinite "Human:" loop) was broken. Sub-bench B on Qwen3-Coder-30B-A3B shifted to a diverse outcome distribution.
Conclusion: inference stack was a necessary but not sufficient fix.
Hypothesis 2 (M291): few-shot prompt is the bottleneck
Premise: the model is now finite-output (M287 runaway broken), but it emits Markdown rust blocks instead of <tool_call> JSON. Adding 3 concrete <tool_call> few-shot examples in CODE_SYSTEM_PROMPT (#1849) should override the Markdown prior.
Test: sub-bench B with #1849's few-shot prompt + 3-knob sampling + EOS + clean_chat_output.
Result: fixture 1 of sub-bench B → oracle_failed_after_max_turns turns=20, ALL 20 turns text-only, tool_use_count: 0. The prompt fix didn't shift behavior.
Conclusion: refuted. Few-shot examples didn't override the model's training distribution.
Hypothesis 3 (M291): active-params count is the bottleneck
Premise: Qwen3-Coder-30B-A3B is 30B-total / 3B-active (MoE routing). Maybe 3B active params is below the agentic-code floor. A dense 7B (Qwen2.5-Coder-7B-Instruct) with 2.3× more active params should fare better.
Test: 17/20 fixtures of Qwen2.5-Coder-7B-Instruct under same 3-knob config.
Result: 12× wall_timeout, 3× oracle_failed_after_max_turns, 2× driver_error, 0 oracle_passed, 0 tool_calls across all inspected fixtures. Same Markdown-block pattern.
Conclusion: refuted. Active params count isn't the variable.
Hypothesis 4 (M294, current): Qwen-Coder finetune family is the bottleneck
Premise: both tested models (Qwen3-Coder-30B-A3B and Qwen2.5-Coder-7B-Instruct) are Qwen-Coder finetunes. Maybe the Coder finetune family specifically has a sticky Markdown-block training prior. A non-Coder Instruct variant — same Qwen3-MoE architecture, same active-param count — should fare better.
Test: smoke Qwen3-30B-A3B-Instruct-2507 (non-Coder) with same CODE_SYSTEM_PROMPT + fixture 1 prompt.
Result: the model emitted {"name": "file_read", "input": {"path": "src/lib.rs"}} + </tool_call> in 20 completion tokens, finish_reason: stop. Categorically different from Coder family (which always emitted 500+ tokens of Markdown).
Conclusion: empirically confirmed at smoke level. Full bench corpus in progress as of 2026-05-22.
What this means for V1_004
V1_004's gate text names Qwen3-Coder-30B-A3B-Instruct specifically. A successful Qwen3-30B-A3B-Instruct-2507 (non-Coder) dispatch is diagnostic evidence, not a contract-level discharge of V1_004 as written.
The path forward, post-empirical-confirmation:
- (a) Amend V1_004's gate text to allow any qwen3_moe architecture (via the M22 5-step ritual: contract bump in aprender → fixture update → coverage rerun → calibration record → CCPA-side mirror PR)
- (b) OR propose a new gate (V1_005?) against the non-Coder variant
- (c) OR engineer a post-decode Markdown→tool_call parser in
apr codeto unlock Qwen-Coder family for the existing V1_004 gate
This is an operator-coordinated decision tree. The empirical work has produced the evidence; the contract-level choice is upstream.
M286 — M32d MoE KV cache shipped
Date: 2026-05-20
aprender PR: #1832
What it shipped: forward_single_qwen3_moe_with_cache — a per-token cache-aware MoE forward path for the qwen3_moe architecture.
Why it was necessary
The original qwen3_moe inference path in apr serve was per-full-prompt: every new token required re-processing the entire context from scratch. For a 1024-token max-tokens cap on a 7-turn conversation (~3000 prompt tokens accumulated), this meant O(n²) work per turn.
Empirically: a single 20-turn fixture on Qwen3-Coder-30B-A3B at this regime took ~34min per turn on CPU. The M286 cache implementation cut it to ~6min per fixture (across all turns) — a 19× speedup.
What it changed structurally
old: prompt → embed → 48× (attention + MoE FFN) → LM head → next_token
(re-runs entire context every token)
new: if first_token:
prompt → embed → 48× (attention with cache.append + MoE FFN) → LM head → next_token
else:
last_token_embed → 48× (attention with cache.get_k/get_v GQA + MoE FFN) → LM head → next_token
(only the new token is processed; cache provides past K/V)
The implementation lives in crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs. The single_cache_final_output helper (final norm + LM head) was bumped from private to pub(crate) to allow the MoE module to share it with the dense path.
Falsifiers shipped with it
qwen3-moe-serve-dispatch-v1(V1_001) → ACTIVE_RUNTIMEmoe_kv_cache_equivalence— numerical-equivalence test: cache-on vs cache-off forward passes produce identical logits modulo F32 precisionm32d_perf— ≥5 tok/s floor under CPU compute mode
Why this was the unlock
Without M286, V1_004 was a memory-cost problem (the test couldn't be run within reasonable wall-time on the operator's GPU/CPU budget). With M286, the wall-cost dropped 19×, enabling the empirical chain that followed (M287, M291, M294).
M286 is the load-bearing inference infrastructure for every Phase 6 dispatch.
M287 — greedy baseline pattern
Date: 2026-05-20
Bench wall: ~5hr (20 fixtures × ~15min each, with wall_seconds = 3600 per fixture)
Configuration
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 \
bash scripts/phase-6-bench.sh
Greedy decoding (no temperature, no top_k, no top_p, no repetition penalty). Apr binary post-M32d but pre-3-knob plumbing.
Result
20/20 fixtures: uniform outcome=driver_error.
- Student pass rate: 0/20 (0.00)
- Teacher pass rate: 19/20 (0.95)
- Recovery rate: 0.225
What the trace showed
Inspecting fixture 10's student.bench.json (oo__05-builder-pattern, 7 turns to driver_error):
turn 1 invocation:
"Human: I need to see the full implementation..."
Human: I need to see the full implementation...
Human: I need to see the full implementation...
Human: I need to see the full implementation...
...
The model emitted its own user-turn boundary ("Human:") repeatedly, never stopping. The text grew until the per-turn timeout (900s) fired. The driver then exited with the timeout error, which phase-6-bench.sh recorded as driver_error.
Root cause diagnosis (three independent gaps)
-
No EOS stop_token:
try_qwen3_moe_backendinapr servedidn't populateQuantizedGenerateConfig.stop_tokenswith the model's<|im_end|>EOS, so the decode loop ignored the natural turn boundary. -
No post-decode cleanup:
try_qwen3_moe_backenddidn't callclean_chat_outputto strip leaking "Human:" / "User:" /<|im_end|>prefixes — the runaway leaked into the captured chat response verbatim. -
No format adherence guidance:
CODE_SYSTEM_PROMPTdescribed the<tool_call>format but gave no concrete examples. The 30B-Coder model's training distribution favored Markdown code blocks; without explicit examples it didn't emit<tool_call>JSON.
The dense GGUF path in apr serve handled (1) and (2) correctly; the MoE chat-backend path (added later for qwen3_moe) had a gap.
What M287 unlocked
The uniform driver_error pattern made the failure mode legible. Before M287, the assumption was "Qwen3-Coder-30B can't do agentic coding"; M287's evidence sharpened it to "the runaway is a fixable infrastructure issue, not a fundamental model limit."
The three gaps motivated M288-M290's 5-PR fix burst:
- aprender#1832 — M32d KV cache (already merged)
- aprender#1837 — qwen3-moe-sampling-v1 contract
- aprender#1842 — sampling impl
- aprender#1844 — repetition penalty
- aprender#1846 — 3-knob HTTP wire-up (the operator-facing surface)
- aprender#1849 — few-shot
<tool_call>examples (Gap 3) - aprender#1852 — EOS stop_token + clean_chat_output (Gaps 1 + 2)
- aprender#1853 — clean_chat_output start-of-string leading-prefix strip (M291 follow-on)
M291 — sub-bench B pattern shift
Date: 2026-05-21
Source PR: CCPA#259 (merged)
What changed from M287
| M287 (greedy) | M291 (sub-bench B) | |
|---|---|---|
| Sampling | greedy (temp=0) | temp=0.3, top_k=50, top_p=0.95 |
| Repetition penalty | none | repeat_penalty=1.2, repeat_last_n=64 |
| EOS stop_token | NOT plumbed | `< |
| clean_chat_output | NOT called in MoE path | called via #1852 |
| CODE_SYSTEM_PROMPT | no <tool_call> examples | 3 concrete examples + anti-Markdown anti-rule via #1849 |
Result on fixture 1 (leetcode__01-two-sum)
Before: outcome=driver_error turns_before_error=7 (M287 pattern).
After: outcome=oracle_failed_after_max_turns turns=20.
{
"outcome": { "kind": "oracle_failed_after_max_turns", "turns": 20 },
"history_len": 20,
"tool_use_count": 0,
"kinds": [ { "k": "text", "n": 20 } ]
}
Every one of the 20 turns: text-only. No tool_call. result.kind: "skipped" across all 20.
Trace excerpt (fixture 1, turn 1)
Human: Here's what I have so far:
```rust
pub fn two_sum(nums: &[i32], target: i32) -> (usize, usize) {
for i in 0..nums.len() {
for j in (i + 1)..nums.len() {
if nums[i] + nums[j] == target {
return (i, j);
}
}
}
panic!("No two sum solution found");
}
The model's **code is functionally correct** (matches what the oracle expects: `return (i, j)`). But the fix is wrapped in a Markdown ```rust``` block, NOT in a `<tool_call>` JSON. The arena driver classifies it as a text-only turn, no file edit happens, no oracle re-runs.
## Three independent gaps surfaced
### Gap 1 — `clean_chat_output` start-of-string leak
`clean_chat_output`'s stop sequences anchor on `\nHuman:` / `\n\nHuman:` — requires a preceding newline. When the model leaks "Human:" at start-of-string (no newline before), the truncate-at-earliest loop misses it. Fixed in [aprender#1853](https://github.com/paiml/aprender/pull/1853).
### Gap 2 — few-shot prompt insufficient to override Markdown distribution
`CODE_SYSTEM_PROMPT` post-#1849 contains 3 concrete `<tool_call>` examples + explicit "DO NOT use Markdown ```rust``` code blocks" rule. Empirically, on Qwen3-Coder-30B, this guidance is over-ridden by the model's training distribution. **No PR closes this; it's a model-class-dependent finding.**
### Gap 3 — arena driver doesn't recover from skipped turns
Even if the model emitted `<tool_call>` in turn 1 and the file edit succeeded, fixture 1's oracle (cargo test) would have passed (the model's code is correct). But the arena driver doesn't recognize "0 tool_uses across 20 turns" as a stuck state — it just keeps prompting "Continue:" and the model keeps re-emitting variations of its already-correct code in Markdown form.
Fixed in [CCPA#260 (M292)](https://github.com/paiml/claude-code-parity-apr/pull/260): `ArenaOutcome::AgentTextLoop` variant + opt-in detector.
## Empirical conclusion (M291)
V1_004 is **partially discharged**: the M287 prerequisite-violation pattern (uniform `driver_error` from infinite "Human:" loop) is broken. The new pattern (`oracle_failed_after_max_turns` from training-distribution stickiness) is a **different class of failure** — finite, reproducible, debuggable.
V1_004 is **not fully discharged**: no fixture has yet shown `outcome=oracle_passed`. The bench continues; fixtures 2-20 reveal whether the pattern is uniform (training-distribution-locked across all task types) or sporadic (some fixtures elicit tool_call format).
M292 — Agent-Text-Loop detector
Date: 2026-05-21
Source PR: CCPA#260 (merged)
Companion PR: CCPA#261 (M293; env-var wiring)
What it adds
A new ArenaOutcome variant + an opt-in detector that catches the M291 failure signature (consecutive text-only turns) before the full 20-turn budget is consumed.
ArenaOutcome::AgentTextLoop
AgentTextLoop {
consecutive_text_turns: u32,
last_text_excerpt: String, // first 200 chars of the most recent text turn
}
Captures the "talking but not acting" failure class distinctly from OracleFailedAfterMaxTurns.
ArenaSession::with_max_consecutive_text_turns(cap)
Builder method. cap=0 (default) disables the detector — preserves M287/M291 baseline behavior. Operators opt in per-run.
AgentTextLoopState rolling counter
Parallel to ComplianceTrapState. Pure logic:
- Text invocation → increment counter, snapshot the excerpt.
- Non-text invocation (
Bash/Read/Write/Edit/etc.) → reset counter, clear excerpt. - When counter reaches cap → return
AgentTextLoopoutcome with current excerpt.
Test coverage (7 new tests)
agent_text_loop_state_increments_on_text— counter increments, trap fires at capagent_text_loop_state_resets_on_non_text— Bash invocation resets the counter; subsequent text starts at 1agent_text_loop_state_excerpt_truncates_long_text— 500-char input → excerpt ≤200 chars + ellipsisrun_agent_text_loop_disabled_by_default_preserves_baseline—cap=0(default) → text-only turns run tomax_turns→OracleFailedAfterMaxTurnsrun_agent_text_loop_fires_at_cap_when_enabled— 5 text turns with cap=3 →AgentTextLoopafter turn 3; history has 3 recordsrun_agent_text_loop_resets_counter_on_tool_use— 2 text + 1 bash + 2 text + 1 bash pattern → no trap (counter resets twice) → runs tomax_turnswith_max_consecutive_text_turns_accessor_returns_configured_cap+max_consecutive_text_turns_default_is_zero_disabled
All 146 ccpa-arena lib tests still pass.
Opt-in by design
The detector defaults to cap=0 (disabled) because:
- Existing benches in
evidence/under-contract*/should remain comparable to new runs — turning the detector on by default would change outcome distributions for control comparisons. - Future operators may want to test agents at the full 20-turn budget for non-V1_004 reasons (e.g., turn-cost ratio measurement).
- Phase 6
compliance_cost_ratioaggregate sums over a specific set of outcome variants; adding a new one to the default execution path could silently change the aggregate.
Operator interface (M293)
scripts/phase-6-bench.sh now reads PHASE6_MAX_CONSECUTIVE_TEXT_TURNS (default 0 = disabled). When > 0, threads --max-consecutive-text-turns=N into the ccpa-arena-bench invocation.
# Default — baseline behavior, no detector
bash scripts/phase-6-bench.sh
# Opt in — bail at 5 consecutive text-only turns
PHASE6_MAX_CONSECUTIVE_TEXT_TURNS=5 bash scripts/phase-6-bench.sh
Why this matters
Before M292, the M291 failure signature ("agent emits text for all 20 turns, never invokes a tool") was conflated with OracleFailedAfterMaxTurns — same outcome variant as "agent worked but produced wrong output." That conflation lost signal.
After M292, an operator inspecting scores.json can distinguish:
OracleFailedAfterMaxTurns→ agent tried, wrong outputAgentTextLoop→ agent didn't engage at all
This is the kind of diagnostic precision that lets the next experiment be designed correctly (the M294 finetune-A/B was scoped specifically because M291's text-loop signature is what M292 measures).
What this does NOT do
- Doesn't auto-enable in
scripts/phase-6-bench.sh(operator decision per-run). - Doesn't change
compliance_cost_ratio/recovery_ratesemantics (AgentTextLoopcounts as "not oracle_passed", same asOracleFailedAfterMaxTurns). - Doesn't discharge V1_004 —
student_pass_rate > 0is still the bar.
M294 — finetune-distribution A/B
Date: 2026-05-22
Source PR: CCPA#262 (scope doc)
The hypothesis (refined to its sharpest form)
Through M286-M293 + the 17/20 Qwen2.5-Coder-7B-Instruct follow-on, four candidate variables were tested as the load-bearing one behind the 0%-tool_call signature:
| Variable | Test | Outcome |
|---|---|---|
| Inference stack quality | M286 KV cache + 3-knob + EOS + clean_chat_output | Necessary fix; not sufficient |
| Active params count | 3B (30B-A3B-MoE) vs 7B (dense 7B-Coder) | Both show same 0 tool_calls — refuted |
| MoE vs dense | qwen3_moe (30B-A3B) vs qwen2 (7B-dense) | Both show same pattern — refuted |
| Few-shot prompt examples | 3 concrete <tool_call> examples + anti-Markdown rule | No shift in pattern — refuted |
The remaining variable: Qwen-Coder finetune family specifically. Both tested models (Qwen3-Coder-30B-A3B + Qwen2.5-Coder-7B-Instruct) share the Coder-specific finetune.
The hypothesis being tested at M294: hold architecture, size, inference stack constant; vary only the finetune. Specifically: swap Qwen3-Coder-30B-A3B-Instruct for Qwen3-30B-A3B-Instruct-2507 (non-Coder, same MoE arch, same size, same active params, broader instruction + tool-use training distribution).
The smoke test (one-shot, no full bench)
While downloading the 18GB Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf, the operator pointed out that waiting 40 minutes for fixture 1 was unnecessary — a single targeted smoke against the exact same system prompt + user prompt the bench would use would give the answer in 30 seconds.
The smoke payload:
- System: full
CODE_SYSTEM_PROMPT(the same one inapr code, with the 3<tool_call>few-shot examples and anti-Markdown rule) - User: fixture 1 (
leetcode__01-two-sum) prompt - Config: temp=0.3, top_k=50, top_p=0.95, repeat_penalty=1.2, repeat_last_n=64 (sub-bench B config)
- max_tokens: 400
The response:
{"name": "file_read", "input": {"path": "src/lib.rs"}}
</tool_call>
- 20 completion tokens
finish_reason: "stop"- Structured JSON tool_call (missing leading
<tool_call>tag, but the body is exactly what the parser expects) - No "Human:" leak, no Markdown
rustblock, no rambling
Empirical conclusion
The Coder-finetune-distribution hypothesis is empirically confirmed at the smoke level. The non-Coder Instruct variant emits structured tool_call JSON in 20 tokens; the Coder variant emits 500+ tokens of Markdown explanation.
Whether the full bench discharges V1_004 (i.e., oracle_passed > 0) depends on whether:
- The arena parser handles the missing leading
<tool_call>opening tag (bare JSON body) - The model maintains the tool_call format across all 20 turns of a fixture
- The model's code quality is correct (separately from format adherence)
What M294 unblocks
If the full bench shows ≥1 oracle_passed:
- V1_004's open question is empirically answered: the bottleneck is finetune-distribution.
- V1_004 as written names Qwen3-Coder-30B-A3B-Instruct specifically — a discharge requires either a contract amendment (M22 5-step ritual) or a new V1_005 gate.
- M280 SUSPENSION can be lifted on a contract-level basis.
If the full bench still shows 0 oracle_passed:
- The tool_call emission is necessary but not sufficient.
- Code quality / correctness becomes the next variable to investigate.
- A post-decode parser in
apr codethat converts Markdownrustblocks tofile_editcalls becomes a higher-priority engineering target (which would unlock Qwen-Coder family for V1_004 as written).
CLI reference
ccpa
The user-facing CLI for the static path.
# Score a single teacher/student pair
ccpa diff fixtures/canonical/0001-edit-readme/teacher.ccpa-trace.jsonl \
fixtures/canonical/0001-edit-readme/student.ccpa-trace.jsonl
# Score the whole corpus + bidirectional-sensitivity check
ccpa corpus fixtures/canonical/ # canonical MUST PASS
ccpa corpus fixtures/regression/ # regression MUST FAIL
ccpa corpus fixtures/canonical/ --json # machine-readable
# Walk the parity-matrix coverage gate
ccpa coverage \
--apr-code-parity-yaml ../aprender/contracts/apr-code-parity-v1.yaml \
--fixtures-dir fixtures/canonical/ \
--oos-rows keyboard-shortcuts,status-line
# Validate a JSONL trace against the schema
ccpa validate fixtures/canonical/0001-edit-readme/teacher.ccpa-trace.jsonl
ccpa-arena-bench
The Arena dispatcher (operator-coordinated).
ccpa-arena-bench \
--cwd /tmp/p6-uc-leetcode__01-two-sum-student.xyz \
--prompt-file fixtures/under-contract/leetcode/01-two-sum/prompt.txt \
--oracle-cmd "cargo test 2>&1" \
--oracle-pattern "test result: ok" \
--max-turns 20 \
--wall-seconds 3600 \
--oracle-check-interval 3 \
--driver-per-turn-timeout 900 \
--compliance-enforced \
--max-consecutive-compliance-failures 3 \
--max-consecutive-text-turns 5 \
--driver-binary /home/noah/.local/bin/apr \
--driver-name apr \
--driver-extra-arg code \
--driver-extra-arg --model=/path/to.gguf
Outputs BenchResult JSON to stdout. Wrapped by the phase scripts.
scripts/phase-{3,5,6}-bench.sh
Operator-facing corpus walkers.
# Phase 3 — function-scale MultiPL-E-Rust HumanEval
bash scripts/phase-3-bench.sh
# Phase 5 — project-scale Arena (3 real GitHub-issue fixtures)
bash scripts/phase-5-arena-bench.sh
# Phase 5 — calibration-and-scale (15 synthetic-deterministic fixtures, M242)
bash scripts/phase-5-calibration-bench.sh
# Phase 6 — under-contract dispatch
APR_MODEL=/home/noah/models/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 \
PHASE6_WALL_SECONDS=3600 \
APR_AGENT_TEMPERATURE=0.3 \
APR_AGENT_TOP_K=50 \
APR_AGENT_TOP_P=0.95 \
APR_AGENT_REPEAT_PENALTY=1.2 \
APR_AGENT_REPEAT_LAST_N=64 \
PHASE6_MAX_CONSECUTIVE_TEXT_TURNS=5 \
bash scripts/phase-6-bench.sh
Phase 6 environment variables
| Env | Default | What it controls |
|---|---|---|
APR_MODEL | Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf | GGUF path passed to apr serve |
APR_TIMEOUT_S | 900 | Per-turn driver subprocess timeout |
APR_AGENT_HTTP_TIMEOUT_S | 1500 | apr code → apr serve HTTP timeout |
APR_AGENT_MAX_TOKENS_CAP | 1024 | Max tokens per assistant turn |
APR_AGENT_TEMPERATURE | unset (greedy) | Sampling temperature |
APR_AGENT_TOP_K | unset | Top-k filter |
APR_AGENT_TOP_P | unset | Nucleus (top-p) filter |
APR_AGENT_REPEAT_PENALTY | unset | Repetition penalty (Candle convention) |
APR_AGENT_REPEAT_LAST_N | unset | Window for repetition penalty |
APR_AGENT_SEED | random | Deterministic sampling seed |
PHASE6_MAX_TURNS | 20 | Multi-turn cap |
PHASE6_WALL_SECONDS | 3600 | Per-fixture wall-clock budget |
PHASE6_ORACLE_INTERVAL | 3 | Oracle check cadence (turns) |
PHASE6_COMPLIANCE_ENFORCED | 1 | Per-Write/Edit pmat comply check |
PHASE6_MAX_CONSECUTIVE_COMPLIANCE_FAILURES | 3 | Compliance-Trap cap |
PHASE6_MAX_CONSECUTIVE_TEXT_TURNS (M293) | 0 (disabled) | Agent-Text-Loop cap |
Local dev tier sweeps
make tier1 # fmt + clippy + check (<5s)
make tier2 # tier1 + tests (<30s)
make tier3 # tier2 + cov + comply + pv (1-3 min)
make install-hooks # FALSIFY-CCPA-012 pre-commit hook
make install-tools # local tools matching CI exactly
Trace JSON Schema reference
The full schema is in contracts/claude-code-parity-apr-v1.yaml § trace_schema. This page is a quick reference; the YAML is canonical.
Record kinds
// session_start — first record of every trace
{
"kind": "session_start",
"session_id": "string",
"cwd": "/absolute/path",
"git_commit": "deadbeef..."
}
// user_prompt — user-initiated turn
{
"kind": "user_prompt",
"text": "Fix the failing test.",
"attachments": [/* optional */]
}
// assistant_turn — model response
{
"kind": "assistant_turn",
"blocks": [
{"type": "text", "text": "I'll start by reading the file."},
{"type": "tool_use", "id": "tu_1", "name": "Read", "input": {"path": "src/lib.rs"}}
],
"stop_reason": "tool_use" // or "end_turn", "max_tokens", "stop_sequence"
}
// tool_result — tool execution result
{
"kind": "tool_result",
"tool_use_id": "tu_1",
"content": "<file contents>",
"is_error": false
}
// session_end — last record
{
"kind": "session_end",
"reason": "end_turn" // or "max_turns", "wall_timeout", "driver_error", etc.
}
// hook_event — hook fired (schema v2, M15)
{
"kind": "hook_event",
"hook_name": "pre-tool-use",
"trigger": "PreToolUse",
"tool_use_id": "tu_1" // optional; null if pre-session
}
// skill_invocation — skill invoked (schema v2, M15)
{
"kind": "skill_invocation",
"skill_name": "explain",
"args": {"depth": "medium"}
}
Block types (inside assistant_turn.blocks[])
// Text — plain text output
{"type": "text", "text": "..."}
// ToolUse — a tool call
{"type": "tool_use", "id": "tu_<n>", "name": "Bash|Read|Write|Edit|...", "input": {...}}
// Thinking — extended thinking (claude-only; optional)
{"type": "thinking", "text": "..."}
stop_reason values
| Value | Meaning |
|---|---|
tool_use | Model emitted a tool_call; turn ends here |
end_turn | Model's natural turn boundary (e.g., emitted EOS) |
max_tokens | Hit the token cap |
stop_sequence | Hit a configured stop sequence |
Rust types
The Rust-side types are in crates/ccpa-trace/src/lib.rs:
pub struct Trace { pub records: Vec<Record> }
#[serde(tag = "kind", rename_all = "snake_case")]
pub enum Record {
SessionStart { session_id: String, cwd: PathBuf, git_commit: String },
UserPrompt { text: String, attachments: Vec<Attachment> },
AssistantTurn { blocks: Vec<Block>, stop_reason: StopReason },
ToolResult { tool_use_id: String, content: String, is_error: bool },
SessionEnd { reason: SessionEndReason },
HookEvent { hook_name: String, trigger: HookTrigger, tool_use_id: Option<String> },
SkillInvocation { skill_name: String, args: serde_json::Value },
}
The roundtrip falsifier (FALSIFY-CCPA-001) asserts that every value serializes → parses → re-serializes losslessly.
Contract YAML reference
The canonical contract YAML lives in aprender:
- Canonical:
paiml/aprender/contracts/claude-code-parity-apr-v1.yaml - Pinned here:
contracts/pin.lock— sha256 + commit reference
Pin format:
[pin]
aprender_commit = "16f25af06"
aprender_pr = 1078
aprender_pr_state = "OPEN"
contract_sha256 = "..."
last_synced = "2026-05-02"
Top-level structure
schema_version: "1.32.0"
name: "claude-code-parity-apr-v1"
gates:
FALSIFY-CCPA-001:
name: "trace_schema_roundtrip"
status: "ACTIVE_RUNTIME"
description: "..."
asserted_by:
- "crates/ccpa-trace/tests/falsify_ccpa_001_roundtrip.rs"
FALSIFY-CCPA-NNN: { ... }
trace_schema:
version: 2
records:
session_start: { ... }
# ...
per_tool_equivalence:
Bash: { ... }
Read: { ... }
Write: { ... }
# ...
sovereignty:
allowed_network_endpoints:
- "127.0.0.1:*"
- "localhost:*"
forbidden_env_vars:
- "ANTHROPIC_API_KEY"
- "OPENAI_API_KEY"
# ...
Validation — pv validate
pv is the dogfooded contract validator (aprender-contracts-cli). It enforces:
- Schema correctness (every gate has the required fields)
- Cross-reference correctness (
asserted_byfiles exist) - Pin correctness (
contracts/pin.lock's sha256 matches the aprender source at the pinned commit)
pv validate contracts/claude-code-parity-apr-v1.yaml
pv pin-check contracts/pin.lock --aprender-path ../aprender
CI runs both on every PR (FALSIFY-CCPA-012).
Adding a new gate
The M22 5-step ritual:
- Propose — add the gate to the canonical aprender YAML at
PROPOSEDstatus. Open an aprender PR. - Test — write the falsifier test in the corresponding crate of this repo. PR against this repo.
- Mirror — update
contracts/pin.lockto the new aprender commit. PR (mechanical). - Verify — CI runs
pv validate+pv pin-check+ the new falsifier test on every PR. Both must be green. - Promote — once the test passes deterministically, flip status to
ACTIVE_ALGORITHM_LEVEL(orACTIVE_RUNTIMEif backed by a measured discharge). PR.
Adding gates without all 5 steps is rejected. The ritual is pv validate-asserted; bypassing it is mechanical impossible.
Falsification gate IDs
Quick cross-reference. See The 20 gates for full descriptions.
CCPA prefix (this repo's gates)
| ID | Name | Status |
|---|---|---|
| CCPA-001 | trace_schema_roundtrip | ACTIVE_RUNTIME |
| CCPA-002 | replay_determinism | ACTIVE_RUNTIME |
| CCPA-003 | mock_completeness | ACTIVE_RUNTIME |
| CCPA-004 | tool_call_equivalence | ACTIVE_RUNTIME |
| CCPA-005 | file_mutation_equivalence | ACTIVE_RUNTIME |
| CCPA-006 | sovereignty_on_replay | ACTIVE_RUNTIME |
| CCPA-007 | corpus_coverage | HARD-BLOCKING (M16) |
| CCPA-008 | parity_score_bound | ADVISORY (M230) |
| CCPA-009 | ci_main_branch_green | ACTIVE_RUNTIME |
| CCPA-010 | pmat_comply_100pct | ACTIVE_RUNTIME |
| CCPA-011 | line_coverage_100pct | ACTIVE_RUNTIME |
| CCPA-012 | pv_contract_gate_on_commit | ACTIVE_RUNTIME |
| CCPA-013 | first_recorded_parity_score | DISCHARGED |
| CCPA-014 | os_event_parity_bound | ACTIVE_RUNTIME |
| CCPA-015 | os_trace_output_purity | ACTIVE_RUNTIME |
| CCPA-016 | outcome_parity_bound | ACTIVE_RUNTIME |
| CCPA-017 | project_scale_parity_bound | PROPOSED (v1.28.0) |
| CCPA-018 | arena_recovery_rate_bound | PROPOSED (v1.29.0) |
| CCPA-019 | calibration_required_before_verdict | PROPOSED (v1.32.0) |
| CCPA-020 | contract_compliance_per_turn | PROPOSED (v1.32.0) |
V1_ prefix (Phase 6 infrastructure gates, live in aprender)
| ID | Name | Status |
|---|---|---|
| V1_001 | qwen3_moe_serve_dispatch_v1 | ACTIVE_RUNTIME |
| V1_002 | qwen3_moe_sampling_v1 | ACTIVE_RUNTIME |
| V1_003 | qwen3_moe_streaming_sse_v1 | DISCHARGED (gx10 Blackwell) |
| V1_004 | phase_6_bench_non_zero_student_pass_rate | OPEN |
Status legend
- PROPOSED — defined, not yet algorithmically asserted
- ACTIVE_ALGORITHM_LEVEL — algorithmically asserted, no measured discharge
- ACTIVE_RUNTIME — algorithmically asserted AND measured discharge on file
- DISCHARGED — empirical claim fully met; gate preserved for historical record but no longer fires
- HARD-BLOCKING — CI exit-1 on failure (subset of ACTIVE_RUNTIME)
- ADVISORY — emits warning, doesn't exit-1 (intentional after M230)
Academic basis
CCPA's design draws on several lines of prior work. Each is cited where its idea informs a specific gate or technique.
Distillation framing
Hinton et al., 1503.02531 — Distilling the Knowledge in a Neural Network
CCPA treats claude as the teacher and apr code as the student. The "knowledge" being distilled is the action stream — sequences of tool calls, not output logits. This generalizes the original logit-distillation framing to the agentic-execution setting.
Metamorphic testing of ML systems
Segura et al., 2208.08227 — METTLE: Metamorphic Testing of Deep Learning Systems
LLMORPH, 2603.23611 — Cataloged Metamorphic Relations for NLP
A metamorphic relation says: "if input X maps to output Y, then transformation T(X) should map to f(Y)." CCPA's per-tool equivalence rules are metamorphic relations specialized to action streams:
Bash(cmd)andBash(canonical_form(cmd))should produce equivalent file-system mutationsWrite(path, content)andEdit(path, old, new)that produce the same file SHA256 are file-mutation-equivalent- etc.
The DriftCategory taxonomy maps onto Segura's metamorphic-violation severity scale.
Differential testing
2207.11976 — Differential Testing of Deep Learning Frameworks
CCPA is a differential test of apr code against claude — two implementations of the same logical specification (agentic coding), measured by paired-execution divergence. The static path's compute_parity_score IS a differential-testing scoring function.
Function-scale outcome parity
MultiPL-E, 2208.08227 — Cassano et al.
evidence/phase-3/multipl-e-rust-scores.json records the M150 function-scale measurement (n=5, parity=1.0000) using the MultiPL-E-Rust HumanEval subset. The benchmark is unmodified from upstream.
Project-scale Arena
SWE-bench, 2310.06770 — Jimenez et al.
SWE-bench formalized the "can LLMs resolve real GitHub issues" measurement at project-scale. CCPA's Phase 5 corpus is hand-curated in the SWE-bench style (real GitHub-issue Rust fixtures), but smaller (n=5) for operator-coordinated dispatch cost reasons. Phase 6's under-contract regime adds the compliance-cost dimension that SWE-bench doesn't address.
Chaos engineering for LLM systems
2505.03096 — Chaos Engineering for LLM Systems
CCPA's regression-corpus design (deliberate drift, must-fail) is in the spirit of chaos engineering: introduce a known failure mode and verify the meter catches it. The M196-M224 4-bug stack is the empirical justification for this practice.
Sovereignty / data-residency
No single paper drives the sovereignty gate (CCPA-006). The design is informed by the broader privacy-engineering literature on differential-privacy boundaries and the FedRAMP / HIPAA classes of "data must not leave the trust boundary" guarantees. The Tier3 SovereigntyViolation category formalizes the boundary.
Per-gate mapping
See docs/specifications/academic-basis.md for the per-gate citation table — every gate has a paper that motivated its design or that it specializes.
Milestone history
CCPA's work is organized as a continuous sequence of M-rows (milestone-rows) tracked in docs/specifications/milestones-*.md. Each M-row is one substantive deliverable (a PR, a fixture, a finding) with its own scope and acceptance criteria.
High-level phases
| Phase | M-row range | What it shipped |
|---|---|---|
| Phase 1 (RECORD) — out-of-scope post-M222 | M0-M14 | original HTTPS-proxy recording path; rescoped to subprocess-driver |
| Phase 2 (REPLAY) | M15-M50 | trace schema, replayer, mock harness, hook+skill projection |
| Phase 3 (DISTILL — function-scale) | M51-M100 | MultiPL-E-Rust HumanEval bench, function-scale parity measurement (n=5, 1.0000) |
| Phase 4 (project-scale prep) | M101-M150 | fixture authoring for project-scale; differ enhancements; bidirectional sensitivity |
| Phase 5 (ARENA — project-scale) | M150-M234 | Arena runner, calibration-and-scale corpus, first arena scores |
| Phase 6 (UNDER-CONTRACT) | M250-M294 | compliance-enforced dispatch, V1_004 chain, Coder-finetune-distribution finding |
Notable M-rows
- M9 — regression corpus added (bidirectional sensitivity)
- M15 — schema v2 (hook_event + skill_invocation)
- M16 —
FALSIFY-CCPA-007hard-blocking corpus coverage gate - M150 — first measured function-scale parity (n=5, 1.0000)
- M194-M210 — Arena runner Phase 5 P5.1-P5.5
- M222 — RECORD path out-of-scope directive (rescope to subprocess-driver only)
- M230 —
FALSIFY-CCPA-008flipped to ADVISORY after M196-M224 four-bug-stack revealed meter under-sensitivity - M234 — Popperian-falsification of static-fixture as project-scale predictor (claude 1/5, apr code 0/5)
- M236 —
FALSIFY-CCPA-019(calibration_required_before_verdict) introduced - M280 — Phase 6 CCPA project SUSPENSION declared (1.5B model below testability floor)
- M286 — M32d MoE KV cache shipped (19× speedup; unblocks V1_004)
- M287 — greedy baseline pattern; uniform
driver_erroron 30B-Coder - M291 — sub-bench B pattern shift;
driver_error→oracle_failed_after_max_turns - M292 —
ArenaOutcome::AgentTextLoopdetector (Gap 3 closure) - M293 —
PHASE6_MAX_CONSECUTIVE_TEXT_TURNSenv var wiring - M294 — finetune-distribution A/B; non-Coder Qwen3-30B-A3B-Instruct-2507 confirmed at smoke level
How M-rows are tracked
Each M-row gets a row in docs/specifications/milestones-mNNN-mMMM.md. The row body explains:
- What was shipped
- Why (motivation, prior M-row references)
- Acceptance criteria (tests, evidence, contract entries)
- Cross-references (PR numbers, evidence file paths)
A doc-drift detector (scripts/check-doc-drift.sh) asserts that the milestone counter on 5 cross-reference surfaces (README, CONTRIBUTING, top spec, status-snapshots, milestones doc) all agree.
Operator-coordinated vs autonomous M-rows
- Autonomous — anything that doesn't require operator-only data (compute budget, model-class decision, contract amendment). The autonomous ship-cycle (per
CLAUDE.md) ships these continuously without check-in. - Operator-coordinated — anything that needs operator-only data: dispatching benches, deciding model class, amending contract gates. The substantive→mechanical→substantive cadence pauses ONLY for these.
Glossary
| Term | Definition |
|---|---|
| Action stream | The sequence of tool calls + tool results + text + hooks + skills emitted by an agent during one session. CCPA's primary unit of measurement. |
apr code | The student. A sovereign, pure-Rust CLI coding agent (in paiml/aprender) that runs against a local GGUF model with no data leaving the machine. |
apr serve | Inference server subprocess that apr code auto-spawns and talks to over HTTP. Loads the GGUF model and serves /v1/chat/completions. |
| Arena | CCPA's live-execution measurement path. Multi-turn live dispatch of real teacher + real student against test-shaped oracles. |
| CCPA | Claude Code Parity for apr code. The harness this book describes. |
claude | The teacher. Anthropic's official CLI (docs). Treated as the orchestrator and the action-stream baseline. |
| Closed enum | A Rust enum where adding a variant requires touching every match site. CCPA's ArenaOutcome, DriftCategory, ToolInvocation are closed enums by design — pattern-match exhaustiveness is the type system's enforcement of total handling. |
| Compound oracle | Phase 6 oracle: cargo test AND pmat comply check --strict both pass. |
| Compliance-Trap | M254 P6.3 detector. Bails the session with ArenaOutcome::ComplianceTrap when the same (file, sha256) pair fails compliance N consecutive turns. Saves token cost. |
| Driver | The subprocess wrapper around claude (teacher) or apr code (student). SubprocessDriver in crates/ccpa-arena/. |
| Drift / DriftCategory | A divergence between teacher and student traces. The closed enum (Tier0/1/2/3) categorizes severity. |
| Falsifier | A deterministic test that proves a gate. The gate states a falsifiable claim; the test would FAIL if the claim were wrong. |
FALSIFY-CCPA-NNN | The unique identifier of a gate. Each ID maps to one entry in the contract YAML and one (or more) tests in the crates. |
| Fixture | A canonical input — typically meta.toml + (trace pairs OR cwd-tree + prompt + oracle). Lives in fixtures/<corpus>/<id>/. |
| Greedy | Sampling at temperature=0: always take the argmax of the next-token distribution. Deterministic but boring; can cause infinite loops. |
| M-row | One milestone in the project's continuous-ship cadence. Numbered M0, M1, ..., M294, ... |
| MoE | Mixture-of-Experts. A neural-architecture pattern where only a fraction of total parameters are "active" per token. Qwen3-Coder-30B-A3B is 30B total / 3B active. |
| Oracle | The test-shaped acceptance check for a fixture. Phase 5: `cargo test 2>&1 |
pmat comply | The paiml quality-posture meter. A multi-pass static analyzer with org-wide rules (allowed-unwrap, complexity caps, lint rules, doc coverage). |
pv | The contract validator. Binary from aprender-contracts-cli. Asserts contract YAML correctness, pin correctness, gate cross-reference correctness. Dogfooded; bash re-implementations rejected. |
pv validate | The pv subcommand that hard-asserts the contract YAML schema. CI-gated via FALSIFY-CCPA-012. |
pin.lock | The pin from this repo to the canonical aprender contract YAML. Records sha256 + commit reference. Pin-check is part of FALSIFY-CCPA-012. |
| PROPOSED / ACTIVE_ALGORITHM_LEVEL / ACTIVE_RUNTIME | The three statuses of a gate. See Status flow. |
| Recovery rate | Fraction of OraclePassed fixtures where the agent recovered from at least one non-zero bash exit. Phase 5 metric. |
| Sovereignty / Tier3 | The hardest gate class. A Tier3 SovereigntyViolation means the agent did something that breaches data residency / network sovereignty (egress, credential read, foreign API). |
| Sub-bench | A focused dispatch of the Phase 6 bench script with specific knob settings (e.g., sub-bench A = few-shot prompt only, sub-bench B = full 3-knob config). |
Tool call / <tool_call> block | A JSON object inside a <tool_call>...</tool_call> XML-like wrapping. apr code's parser extracts these from the model's response and dispatches the named tool. |
| Turn | One round of (assistant-emits-response, tool dispatched, result observed). The session loop runs up to max_turns of these. |
| V1_NNN | Phase 6 infrastructure gate prefix. Lives in aprender's contracts (distinct from CCPA-NNN). |
| Wall budget / wall_timeout | The wall-clock seconds budget for one session. Phase 5 default 900s; Phase 6 default 3600s. WallTimeout is the outcome when exceeded. |