Success Criteria
15.1 Primary Metrics
| Metric | Target | Stretch | Measurement | Notes |
|---|---|---|---|---|
| HumanEval pass@1 | ≥ apr baseline | ≥ HF reference | make eval-humaneval | Relative to Step 0 baseline |
| MBPP pass@1 | ≥ apr baseline | ≥ HF reference | make eval-mbpp | Relative to Step 0 baseline |
| BigCodeBench pass@1 | > 0 (eval works) | ≥ HF reference | make eval-bigcodebench | Stretch: competitive |
| Inference parity | <5% gap vs HF | <2% gap vs HF | apr compare-hf | Perplexity gap on WikiText-2 |
15.2 Infrastructure Metrics
| Metric | Target | Stretch | Notes |
|---|---|---|---|
| Makefile targets | 58 | — | Config-driven: make pipeline RECIPE=... wraps multi-stage pipeline. Includes proof-status, status, check-contracts. |
| Total binary size (compiled, 7B INT4) | < 5GB | < 4GB | 3.5GB weights + runtime |
| Wall-clock (import → submit) | < 24h (GPU) | < 8h (GPU) | CPU-only: much longer |
| Python dependencies | 0 | 0 | External sandbox for eval only |
| CUDA toolkit | Not required | Not required | wgpu handles GPU compute (any vendor) |
| GPU hardware | Recommended (any vendor) | Optional (≤7B) | Required for distill/finetune 32B teacher; NVIDIA, AMD, Intel, or Apple Silicon |
15.3 Quality Metrics
| Metric | Target | Measurement |
|---|---|---|
| Test coverage | ≥ 95% | cargo llvm-cov (project source only — exclude path deps, see §19.7.1) |
| Clippy warnings | 0 | cargo clippy -- -D warnings |
| Source file size | < 500 lines each | wc -l src/**/*.rs |
| pmat comply | Pass | pmat comply check --strict |
| Contract binding coverage | ≥ 95% | pv proof-status |
15.4 Measured Baselines (apr-native)
Baselines measured via apr run + scripts/eval-pass-at-k.sh (greedy decoding, max_tokens=512):
| Model | Quant | HumanEval | MBPP | Backend | Notes |
|---|---|---|---|---|---|
| Qwen2.5-Coder-32B-Instruct | Q4K_M | 90.85% (149/164) | — | CPU (gx10) | Batch mode re-run |
| Qwen2.5-Coder-7B-Instruct (few-shot) | Q4K | 87.20% (143/164) | — | CPU (gx10) | Best 7B HumanEval strategy |
| Qwen2.5-Coder-7B-Instruct | Q4K | 85.37% (140/164) | 76.20% (381/500) | CPU/GPU (gx10) | GPU/CPU parity (HE) |
| Qwen2.5-Coder-7B-Instruct (SCoT) | Q4K | 82.32% (135/164) | — | CPU (gx10) | Structured CoT |
| Qwen3-4B | Q4K | 78.05% (128/164) | — | CPU (gx10) | Thinking model, 4096 tokens |
| Qwen2.5-Coder-1.5B | Q4K | 59.15% (97/164) | — | CPU | Baseline |
HF parity (EvalPlus leaderboard reference): HumanEval 7B gap = 0.60pp (87.20% few-shot vs 87.8%). MBPP 7B gap = 7.3pp (76.20% vs 83.5%). 32B HE gap = 1.65pp (90.85% vs 92.5%). Note: Qwen model card reports 88.4%/92.7% (different test harness).
Oracle upper bounds: HumanEval 96.34% (158/164, best-per-problem across all strategies). Only 6 problems never solved. See §24.19.
Perplexity baseline: 6.63 on WikiText-2 (1.5B Q4K, CPU). Cross-entropy: 1.89 nats.
Contract gate: make check-contracts — 67/68 passing. 1 failure: AC-022 MBPP gate (76.2% < 80%). See §17.6.
Acceptance criteria: 19/29 verified (66%). See §18. Critical path: PMAT-014 → PMAT-008 → PMAT-010 → PMAT-011 → AC-022.
15.5 Falsifiability
Every target above is falsifiable: it has a concrete measurement command, a numeric threshold, and a pass/fail outcome. If a metric cannot be measured, the spec has failed — not the implementation.