Success Criteria

15.1 Primary Metrics

MetricTargetStretchMeasurementNotes
HumanEval pass@1≥ apr baseline≥ HF referencemake eval-humanevalRelative to Step 0 baseline
MBPP pass@1≥ apr baseline≥ HF referencemake eval-mbppRelative to Step 0 baseline
BigCodeBench pass@1> 0 (eval works)≥ HF referencemake eval-bigcodebenchStretch: competitive
Inference parity<5% gap vs HF<2% gap vs HFapr compare-hfPerplexity gap on WikiText-2

15.2 Infrastructure Metrics

MetricTargetStretchNotes
Makefile targets58Config-driven: make pipeline RECIPE=... wraps multi-stage pipeline. Includes proof-status, status, check-contracts.
Total binary size (compiled, 7B INT4)< 5GB< 4GB3.5GB weights + runtime
Wall-clock (import → submit)< 24h (GPU)< 8h (GPU)CPU-only: much longer
Python dependencies00External sandbox for eval only
CUDA toolkitNot requiredNot requiredwgpu handles GPU compute (any vendor)
GPU hardwareRecommended (any vendor)Optional (≤7B)Required for distill/finetune 32B teacher; NVIDIA, AMD, Intel, or Apple Silicon

15.3 Quality Metrics

MetricTargetMeasurement
Test coverage≥ 95%cargo llvm-cov (project source only — exclude path deps, see §19.7.1)
Clippy warnings0cargo clippy -- -D warnings
Source file size< 500 lines eachwc -l src/**/*.rs
pmat complyPasspmat comply check --strict
Contract binding coverage≥ 95%pv proof-status

15.4 Measured Baselines (apr-native)

Baselines measured via apr run + scripts/eval-pass-at-k.sh (greedy decoding, max_tokens=512):

ModelQuantHumanEvalMBPPBackendNotes
Qwen2.5-Coder-32B-InstructQ4K_M90.85% (149/164)CPU (gx10)Batch mode re-run
Qwen2.5-Coder-7B-Instruct (few-shot)Q4K87.20% (143/164)CPU (gx10)Best 7B HumanEval strategy
Qwen2.5-Coder-7B-InstructQ4K85.37% (140/164)76.20% (381/500)CPU/GPU (gx10)GPU/CPU parity (HE)
Qwen2.5-Coder-7B-Instruct (SCoT)Q4K82.32% (135/164)CPU (gx10)Structured CoT
Qwen3-4BQ4K78.05% (128/164)CPU (gx10)Thinking model, 4096 tokens
Qwen2.5-Coder-1.5BQ4K59.15% (97/164)CPUBaseline

HF parity (EvalPlus leaderboard reference): HumanEval 7B gap = 0.60pp (87.20% few-shot vs 87.8%). MBPP 7B gap = 7.3pp (76.20% vs 83.5%). 32B HE gap = 1.65pp (90.85% vs 92.5%). Note: Qwen model card reports 88.4%/92.7% (different test harness).

Oracle upper bounds: HumanEval 96.34% (158/164, best-per-problem across all strategies). Only 6 problems never solved. See §24.19.

Perplexity baseline: 6.63 on WikiText-2 (1.5B Q4K, CPU). Cross-entropy: 1.89 nats.

Contract gate: make check-contracts — 67/68 passing. 1 failure: AC-022 MBPP gate (76.2% < 80%). See §17.6.

Acceptance criteria: 19/29 verified (66%). See §18. Critical path: PMAT-014 → PMAT-008 → PMAT-010 → PMAT-011 → AC-022.

15.5 Falsifiability

Every target above is falsifiable: it has a concrete measurement command, a numeric threshold, and a pass/fail outcome. If a metric cannot be measured, the spec has failed — not the implementation.