Milestone history

CCPA's work is organized as a continuous sequence of M-rows (milestone-rows) tracked in docs/specifications/milestones-*.md. Each M-row is one substantive deliverable (a PR, a fixture, a finding) with its own scope and acceptance criteria.

High-level phases

PhaseM-row rangeWhat it shipped
Phase 1 (RECORD) — out-of-scope post-M222M0-M14original HTTPS-proxy recording path; rescoped to subprocess-driver
Phase 2 (REPLAY)M15-M50trace schema, replayer, mock harness, hook+skill projection
Phase 3 (DISTILL — function-scale)M51-M100MultiPL-E-Rust HumanEval bench, function-scale parity measurement (n=5, 1.0000)
Phase 4 (project-scale prep)M101-M150fixture authoring for project-scale; differ enhancements; bidirectional sensitivity
Phase 5 (ARENA — project-scale)M150-M234Arena runner, calibration-and-scale corpus, first arena scores
Phase 6 (UNDER-CONTRACT)M250-M294compliance-enforced dispatch, V1_004 chain, Coder-finetune-distribution finding

Notable M-rows

  • M9 — regression corpus added (bidirectional sensitivity)
  • M15 — schema v2 (hook_event + skill_invocation)
  • M16FALSIFY-CCPA-007 hard-blocking corpus coverage gate
  • M150 — first measured function-scale parity (n=5, 1.0000)
  • M194-M210 — Arena runner Phase 5 P5.1-P5.5
  • M222 — RECORD path out-of-scope directive (rescope to subprocess-driver only)
  • M230FALSIFY-CCPA-008 flipped to ADVISORY after M196-M224 four-bug-stack revealed meter under-sensitivity
  • M234 — Popperian-falsification of static-fixture as project-scale predictor (claude 1/5, apr code 0/5)
  • M236FALSIFY-CCPA-019 (calibration_required_before_verdict) introduced
  • M280 — Phase 6 CCPA project SUSPENSION declared (1.5B model below testability floor)
  • M286 — M32d MoE KV cache shipped (19× speedup; unblocks V1_004)
  • M287 — greedy baseline pattern; uniform driver_error on 30B-Coder
  • M291 — sub-bench B pattern shift; driver_errororacle_failed_after_max_turns
  • M292ArenaOutcome::AgentTextLoop detector (Gap 3 closure)
  • M293PHASE6_MAX_CONSECUTIVE_TEXT_TURNS env var wiring
  • M294 — finetune-distribution A/B; non-Coder Qwen3-30B-A3B-Instruct-2507 confirmed at smoke level

How M-rows are tracked

Each M-row gets a row in docs/specifications/milestones-mNNN-mMMM.md. The row body explains:

  • What was shipped
  • Why (motivation, prior M-row references)
  • Acceptance criteria (tests, evidence, contract entries)
  • Cross-references (PR numbers, evidence file paths)

A doc-drift detector (scripts/check-doc-drift.sh) asserts that the milestone counter on 5 cross-reference surfaces (README, CONTRIBUTING, top spec, status-snapshots, milestones doc) all agree.

Operator-coordinated vs autonomous M-rows

  • Autonomous — anything that doesn't require operator-only data (compute budget, model-class decision, contract amendment). The autonomous ship-cycle (per CLAUDE.md) ships these continuously without check-in.
  • Operator-coordinated — anything that needs operator-only data: dispatching benches, deciding model class, amending contract gates. The substantive→mechanical→substantive cadence pauses ONLY for these.