Quality Gates (pmat comply)

Every pipeline step and every commit MUST pass the pmat comply quality gates. This is the enforcement mechanism for the claims in this spec.

17.1 Specification Compliance

This spec itself is validated by pmat comply:

# Score this specification (must achieve ≥95/100)
pmat spec score docs/specifications/leaderboard-spec.md --verbose

# Extract falsifiable claims and generate review checklist
pmat comply review docs/specifications/leaderboard-spec.md --format markdown

# Full compliance audit with signed evidence
pmat comply audit -o audit.json

17.2 Mandatory Pre-Commit Checks

# Full compliance check (blocks commit on failure)
pmat comply check --strict --format json

# Key checks enforced:
#   CB-200  TDG Grade Gate — no function below grade A
#   CB-303  Equation-Driven Development — contract bindings present
#   CB-125  Coverage quality — ≥95% with no exclusion gaming
#   CB-304  Dead code — 0% tolerance
#   CB-120  OIP Tarantula — no NaN, no unwrap in production paths

17.3 Pipeline Quality Gates

Each recipe step has a pmat comply gate:

Pipeline Steppmat GateBlocks On
Importapr check model.apr + pmat comply checkFormat validation failure, contract binding gaps
Distillpv proof-status for attention/softmax contractsUnverified kernel obligations
Finetunepmat comply check --strict + coverage ≥95%TDG regression, coverage drop
Mergepv audit for merge strategy contractsUnbound merge kernel
Pruneapr eval before/after + pmat comply baselineQuality regression beyond threshold
Quantizepv proof-status for Q4K/Q6K contractsKani proof failure
Evalpmat comply review extracts claims → validatesUntested falsifiable claims
Submitpmat comply audit signed evidenceIncomplete audit trail

17.4 Cross-Crate Consistency

The sovereign stack (aprender, entrenar, trueno) MUST maintain cross-crate consistency:

# Detect API divergence and copy-paste duplication across stack
pmat comply cross-crate \
    --crates ../aprender ../entrenar ../trueno . \
    --similarity-threshold 0.80 \
    --strict

# Verify no contract drift between crates
pv diff ../provable-contracts/contracts/old/ ../provable-contracts/contracts/

17.5 Documentation Publishing

This specification is published as an mdBook via GitHub Actions. On every push to main that modifies docs/ or book.toml, the workflow builds and deploys to GitHub Pages at:

https://paiml.github.io/apr-leaderboard/

The mdBook source lives in docs/src/ with chapters split from the canonical spec at docs/specifications/leaderboard-spec.md. The build output (docs/book/) is gitignored.

# Local preview
mdbook serve    # http://localhost:3000

# Build only
mdbook build    # outputs to docs/book/

17.6 Contract Falsification Gate

make check-contracts runs all provable contract falsification tests as a single gate. This is the primary automated quality check for the project.

make check-contracts    # runs all falsification tests + contract structure validation

Test categories (67/68 passing, 2026-04-04):

CategoryCountWhat it checks
pass@k estimator5Chen et al. boundary conditions, monotonicity
throughput bounds2tok/s >= 1.0, TTFT < 500ms
benchmark data3HumanEval/MBPP/BigCodeBench problem counts
decontamination1Zero HE/MBPP prompt overlap
eval results3Best pass@1, run count, latest score
distillation2Teacher > student, category coverage
MBPP eval1Best MBPP pass@1 >= 70%
AC-022 gate1HE >= 85% AND MBPP >= 80% (compound)
quantization3Q4K size, apr check, golden ordering
distillation data3Teacher completions count + JSONL validity
oracle analysis2Oracle upper bound, never-solved count
pipeline3Script count, config count, Make target count
compile1apr compile subcommand available
data catalog2Contract bindings, dataset documentation
leaderboard coverage2Eval run count, benchmark coverage
HF parity1HumanEval gap < 5pp vs HF reference
contract coverage1>= 25 contract YAMLs
data quality2Zero duplicate instructions, no short responses
quantization quality132B Q4K gap < 2pp vs HF FP16
contract structure29All YAMLs have metadata/equations/proof_obligations/falsification_tests

Single known failure: FT-GATE-001 (AC-022 compound gate) — MBPP at 76.2% vs 80% target. Closing via PMAT-008 (DPO) + PMAT-007 (distillation).

pv proof-status: Validates contract YAML schema via provable-contracts tooling. 28/28 contracts parsed, 98 proof obligations, 10 Kani harnesses. See §16.5.