Quality Gates (pmat comply)
Every pipeline step and every commit MUST pass the pmat comply quality gates. This is the enforcement mechanism for the claims in this spec.
17.1 Specification Compliance
This spec itself is validated by pmat comply:
# Score this specification (must achieve ≥95/100)
pmat spec score docs/specifications/leaderboard-spec.md --verbose
# Extract falsifiable claims and generate review checklist
pmat comply review docs/specifications/leaderboard-spec.md --format markdown
# Full compliance audit with signed evidence
pmat comply audit -o audit.json
17.2 Mandatory Pre-Commit Checks
# Full compliance check (blocks commit on failure)
pmat comply check --strict --format json
# Key checks enforced:
# CB-200 TDG Grade Gate — no function below grade A
# CB-303 Equation-Driven Development — contract bindings present
# CB-125 Coverage quality — ≥95% with no exclusion gaming
# CB-304 Dead code — 0% tolerance
# CB-120 OIP Tarantula — no NaN, no unwrap in production paths
17.3 Pipeline Quality Gates
Each recipe step has a pmat comply gate:
| Pipeline Step | pmat Gate | Blocks On |
|---|---|---|
| Import | apr check model.apr + pmat comply check | Format validation failure, contract binding gaps |
| Distill | pv proof-status for attention/softmax contracts | Unverified kernel obligations |
| Finetune | pmat comply check --strict + coverage ≥95% | TDG regression, coverage drop |
| Merge | pv audit for merge strategy contracts | Unbound merge kernel |
| Prune | apr eval before/after + pmat comply baseline | Quality regression beyond threshold |
| Quantize | pv proof-status for Q4K/Q6K contracts | Kani proof failure |
| Eval | pmat comply review extracts claims → validates | Untested falsifiable claims |
| Submit | pmat comply audit signed evidence | Incomplete audit trail |
17.4 Cross-Crate Consistency
The sovereign stack (aprender, entrenar, trueno) MUST maintain cross-crate consistency:
# Detect API divergence and copy-paste duplication across stack
pmat comply cross-crate \
--crates ../aprender ../entrenar ../trueno . \
--similarity-threshold 0.80 \
--strict
# Verify no contract drift between crates
pv diff ../provable-contracts/contracts/old/ ../provable-contracts/contracts/
17.5 Documentation Publishing
This specification is published as an mdBook via GitHub Actions. On every push to main that modifies docs/ or book.toml, the workflow builds and deploys to GitHub Pages at:
https://paiml.github.io/apr-leaderboard/
The mdBook source lives in docs/src/ with chapters split from the canonical spec at docs/specifications/leaderboard-spec.md. The build output (docs/book/) is gitignored.
# Local preview
mdbook serve # http://localhost:3000
# Build only
mdbook build # outputs to docs/book/
17.6 Contract Falsification Gate
make check-contracts runs all provable contract falsification tests as a single gate. This is the primary automated quality check for the project.
make check-contracts # runs all falsification tests + contract structure validation
Test categories (67/68 passing, 2026-04-04):
| Category | Count | What it checks |
|---|---|---|
| pass@k estimator | 5 | Chen et al. boundary conditions, monotonicity |
| throughput bounds | 2 | tok/s >= 1.0, TTFT < 500ms |
| benchmark data | 3 | HumanEval/MBPP/BigCodeBench problem counts |
| decontamination | 1 | Zero HE/MBPP prompt overlap |
| eval results | 3 | Best pass@1, run count, latest score |
| distillation | 2 | Teacher > student, category coverage |
| MBPP eval | 1 | Best MBPP pass@1 >= 70% |
| AC-022 gate | 1 | HE >= 85% AND MBPP >= 80% (compound) |
| quantization | 3 | Q4K size, apr check, golden ordering |
| distillation data | 3 | Teacher completions count + JSONL validity |
| oracle analysis | 2 | Oracle upper bound, never-solved count |
| pipeline | 3 | Script count, config count, Make target count |
| compile | 1 | apr compile subcommand available |
| data catalog | 2 | Contract bindings, dataset documentation |
| leaderboard coverage | 2 | Eval run count, benchmark coverage |
| HF parity | 1 | HumanEval gap < 5pp vs HF reference |
| contract coverage | 1 | >= 25 contract YAMLs |
| data quality | 2 | Zero duplicate instructions, no short responses |
| quantization quality | 1 | 32B Q4K gap < 2pp vs HF FP16 |
| contract structure | 29 | All YAMLs have metadata/equations/proof_obligations/falsification_tests |
Single known failure: FT-GATE-001 (AC-022 compound gate) — MBPP at 76.2% vs 80% target. Closing via PMAT-008 (DPO) + PMAT-007 (distillation).
pv proof-status: Validates contract YAML schema via provable-contracts tooling. 28/28 contracts parsed, 98 proof obligations, 10 Kani harnesses. See §16.5.