Implementation Status

Tracking table mapping spec sections to apr-leaderboard implementation. Updated as code lands.

19.1 Orchestration Targets (§6.2)

apr-leaderboard is a thin orchestrator — a Makefile + shell scripts — that calls apr CLI subcommands. There is no Rust source code; all ML operations are delegated to aprender.

Make TargetScript/CommandStatusNotes
make importapr import hf://$(MODEL) -o $(CHECKPOINT)✅ WorkingReal HF download, GGUF and SafeTensors paths
make finetuneapr finetune $(CHECKPOINT) --method lora ...✅ Workingwgpu QLoRA (592 GFLOPS), SFT + DPO auto-detect, adapter export, 13 KAIZEN fixes
make mergeapr merge $(MODELS) --strategy slerp ...✅ WiredSLERP/TIES/DARE/Linear
make pruneapr prune $(CHECKPOINT) --method wanda ...✅ WiredWanda/magnitude pruning
make quantizeapr quantize $(CHECKPOINT) --scheme int4 ...✅ WiredINT4/INT8/Q4K/Q5K/Q6K
make distillapr distill $(TEACHER) --student $(STUDENT) ...✅ WiredStandard/progressive/ensemble
make compileapr compile $(CHECKPOINT) --release --lto✅ WiredStandalone binary compilation
make eval-humanevalscripts/eval-pass-at-k.sh humaneval $(CHECKPOINT)✅ WorkingGenerate + sandbox execute + pass@k
make eval-mbppscripts/eval-pass-at-k.sh mbpp $(CHECKPOINT)✅ WorkingSame pipeline, MBPP dataset
make eval-bigcodebenchscripts/eval-pass-at-k.sh bigcodebench $(CHECKPOINT)✅ WorkingSame pipeline, BigCodeBench dataset
make eval-allLoops over all benchmarks✅ WorkingRuns humaneval + mbpp + bigcodebench
make eval-perplexityapr eval $(CHECKPOINT) --dataset wikitext-2 --json✅ WorkingPerplexity baseline
make exportapr export $(CHECKPOINT) --format safetensors✅ WiredSafeTensors/GGUF/MLX/ONNX
make publishscripts/submit.sh $(CHECKPOINT) $(HF_REPO)✅ WorkingDry-run + confirm + HF Hub upload
make model-cardapr eval $(CHECKPOINT) --generate-card --json✅ WiredModel card generation
make pipelinescripts/pipeline.sh configs/recipes/$(RECIPE).yaml✅ WorkingConfig-driven multi-stage pipeline (YAML-first)
make pipeline-planscripts/pipeline.sh --plan ...✅ WorkingDry-run: validate config, show commands
make validatebashrs config lint + bashrs lint + bashrs make lint✅ WorkingSovereign stack config validation (zero Python)
make checkapr check $(CHECKPOINT) --json✅ WorkingAPR file integrity validation
make inspectapr inspect $(CHECKPOINT)✅ WorkingModel inspection
make verifySmoke-tests all apr subcommands✅ Working19 subcommands verified
make dogfoodEnd-to-end smoke test✅ WorkingCLI + configs validated
make prove-wgpuscripts/prove-wgpu.sh✅ Workingwgpu training proof (§22.14)
make alignapr finetune --method dpo/orpo✅ WiredDPO/ORPO alignment (GH-8)
make bookmdbook build✅ WorkingBuild specification book
make docsmdbook build✅ WorkingAlias for book
make docs-servemdbook serve✅ WorkingLocal book preview
make prep-dataapr data prep🔧 BlockedSubcommand not wired yet (GH-12)
make prep-data-auditapr data audit --verbose✅ WorkingDetailed corpus audit
make data-splitapr data split✅ WorkingStratified train/val/test split
make data-balanceapr data balance✅ WorkingResample for class balance
make finetune-instructapr finetune --task instruct✅ WiredInstruction LoRA fine-tuning
make import-planHF Hub check + dry-run✅ WorkingImport plan preview
make cleanrm -rf checkpoints/ results/✅ WorkingRemove build artifacts
make decontaminateapr data decontaminate🔄 PR Openaprender#415 + alimentar#32 (GH-11)
make data-qualityapr data quality🔧 BlockedSubcommand not wired yet (GH-11)
make qaapr qa $(CHECKPOINT) --verbose✅ WiredFull model QA gate
make compare-hfapr compare-hf --hf $(MODEL) --json $(CHECKPOINT)✅ WorkingHF parity check (requires MODEL)
make benchapr bench $(CHECKPOINT) --json✅ WorkingThroughput benchmark
make benchmark-downloadscripts/download-benchmarks.sh✅ WorkingDownload HumanEval/MBPP data
make results-historyscripts/results-history.sh✅ WorkingView and compare eval results
make eval-sweepscripts/eval-sweep.sh✅ WorkingSweep all result JSONs, tabulate pass@k
make compare-resultsscripts/compare-results.sh✅ WorkingDelta analysis between two result files
make leaderboardscripts/leaderboard-summary.sh✅ WorkingGenerate ranked markdown leaderboard from results
make check-contractsInline awk + jq + python3✅ Working15 falsification tests (pass@k, throughput, data, eval, structure)
make generate-preference-pairsscripts/generate-preference-pairs.sh✅ WorkingGenerate DPO pairs from N-sampling eval (PMAT-014)
make generate-training-datascripts/generate-training-data.sh✅ WorkingSynthetic instruct pairs from teacher model (PMAT-004)
make distill-generatescripts/distill-generate.sh✅ WorkingText-based distillation: 32B teacher completions (PMAT-007)
make distill-finetuneapr finetune --method qlora✅ WiredQLoRA fine-tune 7B on teacher completions (PMAT-007)
make distill-evalscripts/eval-pass-at-k.sh✅ WiredEvaluate distilled model on HumanEval (PMAT-007)
make combine-training-datascripts/combine-training-data.sh✅ WorkingMerge distill + instruct data for QLoRA (PMAT-008)
make validate-teacherscripts/validate-teacher.sh✅ WorkingVerify teacher model quality before distillation (§12.2)
make failure-analysisscripts/failure-analysis.sh✅ WorkingAlways-fail/borderline/always-pass categorization

19.2 Shell Scripts

ScriptPurposeStatus
scripts/eval-pass-at-k.shDownload benchmark → generate completions via apr run → strip markdown fences → sandbox execute (python3/Docker) → Chen et al. unbiased pass@k estimator → write JSON✅ Working
scripts/pipeline.shParse recipe YAML (bash-native) → determine stages → execute sequentially with eval config (prompt_strategy, max_tokens) → --plan dry-run✅ Working
scripts/submit.shPre-submission checks (§14.4) → export SafeTensors → model card → dry-run → publish to HF Hub✅ Working
scripts/import.shWrapper around apr import with HF Hub reachability check + apr check validation✅ Working
scripts/prove-wgpu.shEnd-to-end wgpu training proof: import → train (QLoRA) → verify → report✅ Working
scripts/download-benchmarks.shDownload HumanEval/MBPP benchmark data for eval + decontamination✅ Working
scripts/results-history.shView and compare evaluation results with filtering by benchmark/model✅ Working
scripts/leaderboard-summary.shGenerate ranked markdown leaderboard from all result JSONs✅ Working
scripts/eval-sweep.shRun eval across multiple prompt strategies sequentially✅ Working
scripts/compare-results.shPer-problem delta analysis between two result files✅ Working
scripts/distill-generate.sh32B teacher batch inference → coding completions JSONL (PMAT-007)✅ Working
scripts/generate-distill-prompts.shGenerate targeted distillation prompts from HumanEval failure analysis✅ Working
scripts/combine-training-data.shMerge teacher completions + instruct corpus, deduplicate, shuffle✅ Working
scripts/validate-teacher.shValidate teacher model meets minimum pass@1 threshold for distillation✅ Working
scripts/failure-analysis.shAnalyze HumanEval failures: always-fail, borderline, always-pass✅ Working
scripts/oracle-analysis.shCompute oracle upper bound across all runs and strategies✅ Working

19.3 Quality Metrics

MetricCurrentTargetGate
apr CLI version0.4.11≥ 0.4.10apr --version
Subcommand smoke test19/19 OK19/19make verify
YAML configs24models (7) + recipes (11) + eval (1) + pipeline (2) + data catalog (1) + distill (1) + data governance (1)
Shell scripts22 + 4 canaries22 pipeline scripts + 4 GPU canary/falsification scripts
Makefile targets56make verify + make validate + make dogfood
Contract tests67/6868/68make check-contracts 18 categories + structure ×29. 1 fail: MBPP gate.
Contract YAMLs2828 provable contract YAMLs. New: binding-coverage, hf-parity, ties-sign-resolution.
Make targets57All wired to real apr CLI
PMAT work items8PMAT-006 (done), PMAT-007 (done-pipeline, merge re-run pending matmul fix), PMAT-008 (ready), PMAT-010 (pending), PMAT-011 (pending), PMAT-014 (in progress, 28%), PMAT-017 (done), PMAT-037 (done). See §27.
Spec sections27§1-27: v2.5.1 update cycle
Config validity22/2222/22bashrs config lint in make validate (zero Python)
Pipeline stages12import → distill → finetune → align → merge → prune → quantize → eval → submit → compile

19.4 Config Templates (§4)

ConfigLocationModelStrategyStatus
qwen-coder-7b.yamlconfigs/models/Qwen2.5-Coder-7BLoRA finetune → eval✅ Complete
qwen-coder-32b.yamlconfigs/models/Qwen2.5-Coder-32BEval only (q8)✅ Complete
qwen-coder-1.5b.yamlconfigs/models/Qwen2.5-Coder-1.5BQLoRA → prune → INT4 → compile✅ Complete
deepseek-r1-distill-7b.yamlconfigs/models/DeepSeek-R1-Distill-Qwen-7BDPO align → prune → INT4✅ Complete
phi-4.yamlconfigs/models/Phi-4LoRA finetune → INT8✅ Complete
qwen3-4b.yamlconfigs/models/Qwen3-4BThinking model eval (§22.17)✅ Complete
qwen3-8b.yamlconfigs/models/Qwen3-8BQLoRA instruct + eval✅ Complete
recipe-a-quick-lora.yamlconfigs/recipes/Qwen2.5-Coder-7B-InstructQuick LoRA (§9.1)✅ Complete
recipe-b-merge-alchemist.yamlconfigs/recipes/Qwen2.5-Coder-7B-InstructZero-training merge (§9.2)✅ Complete
recipe-c-full-pipeline.yamlconfigs/recipes/Qwen2.5-Coder-7BFull pipeline (§9.3)✅ Complete
recipe-d-sovereign-binary.yamlconfigs/recipes/Qwen2.5-Coder-1.5BSovereign binary (§9.4)✅ Complete
recipe-e-instruct-finetune.yamlconfigs/recipes/Qwen2.5-Coder-7B-InstructInstruct fine-tune (§9.5)✅ Complete
recipe-f-qwen3-qlora.yamlconfigs/recipes/Qwen3-8BQLoRA instruct pipeline (§9.6)✅ Complete
recipe-g-wgpu-proof.yamlconfigs/recipes/Qwen2.5-Coder-1.5Bwgpu training proof (§22.14)✅ Complete
recipe-h-32b-distill.yamlconfigs/recipes/Qwen2.5-Coder-7B-Instruct32B→7B reasoning distillation✅ Complete
recipe-i-humaneval-qlora.yamlconfigs/recipes/Qwen2.5-Coder-7B-InstructQLoRA on teacher+instruct data (PMAT-008)✅ Complete
recipe-j-merge-specialists.yamlconfigs/recipes/Qwen2.5-Coder-7B-InstructTIES merge code+reasoning specialists (PMAT-010)✅ Complete
recipe-k-final-artifact.yamlconfigs/recipes/Qwen2.5-Coder-7B-InstructPrune+quantize+compile final submission (PMAT-011)✅ Complete
distill-32b-7b-text.yamlconfigs/distill/Qwen2.5-Coder-7B-InstructText-based distillation config (PMAT-007)✅ Complete
coding-benchmarks.yamlconfigs/eval/Benchmark suite definitions + targets + baselines✅ Complete
leaderboard.yamlconfigs/pipeline/Forjar infrastructure manifest✅ Complete
leaderboard-playbook.yamlconfigs/pipeline/Batuta playbook DAG✅ Complete
data_catalog.yamlrootData governance, lineage, classification✅ Complete

19.4.1 GPU Sharing Infrastructure (entrenar)

The GPU-SHARE specification is fully implemented in entrenar with 143 tests across all modules.

ComponentModuleStatusTests
VRAM guardentrenar::gpu::guard✅ Complete12
VRAM ledger (flock + JSON)entrenar::gpu::ledger✅ Complete15
Wait-for-VRAM queueentrenar::gpu::wait✅ Complete8
GPU profilerentrenar::gpu::profiler✅ Complete6
MPS (experimental)entrenar::gpu::mps✅ Complete11
Cluster configentrenar::gpu::cluster✅ Complete12
Job placemententrenar::gpu::placement✅ Complete10
Checkpoint coordinatorentrenar::gpu::coordinator✅ Complete16
Multi-adapter pipelineentrenar::finetune::multi_adapter_pipeline✅ Complete18

CLI flags: --wait-gpu, --vram, --experimental-mps, --gpu-share, --adapters, --adapters-config

19.5 apr CLI Subcommand Availability

All ML operations are provided by apr CLI v0.4.11. Verified via make verify:

apr SubcommandStatusUsed By
apr import✅ OKmake import, scripts/import.sh, scripts/pipeline.sh
apr run✅ OKscripts/eval-pass-at-k.sh (generate completions), --batch-jsonl batch mode
apr serve✅ OK(HTTP API — partial: doesn't bind for .apr files)
apr chat✅ OK(interactive — not used by pipeline)
apr finetune⚠️ PartialTraining loop runs on gx10 with CUDA (backward GEMM f64 fix, GH-561). Loss: 13.61 train → 12.02 val on 3-sample test. APR adapter export (§26 Phase 3) not yet implemented.
apr merge✅ OKmake merge, scripts/pipeline.sh
apr prune✅ OKmake prune, scripts/pipeline.sh
apr quantize✅ OKmake quantize, scripts/pipeline.sh
apr distill✅ OKmake distill, scripts/pipeline.sh
apr eval✅ OKmake eval-perplexity, make model-card
apr export✅ OKmake export, scripts/submit.sh
apr publish✅ OKscripts/submit.sh
apr check✅ OKmake check, scripts/import.sh
apr compile✅ OKmake compile, scripts/pipeline.sh
apr bench✅ OK(latency benchmarks — not used by pipeline)
apr inspect✅ OKmake inspect
apr data✅ OKmake prep-data, make decontaminate, make prep-data-audit
apr qa✅ OKmake qa
apr compare-hf✅ OKmake compare-hf

19.6 Dogfooding Findings

End-to-end dogfooding with real model import and inference. See also §22 for detailed findings.

19.6.1 GGUF vs SafeTensors Import Path

SafeTensors imports produce F16/BF16 tensors that realizar cannot run inference on (fused matmul only supports Q4_0/Q8_0/Q4_K/Q5_K/Q6_K). GGUF import (pre-quantized Q4_K_M) is the working path — produces runnable models with embedded tokenizer.

Import Pathapr check ScoreInferenceNotes
SafeTensors (F16)F (3/100)Fails"Fused matmul only supports Q4_0/Q8_0/Q4_K/Q5_K/Q6_K, got type 30"
GGUF (Q4_K_M)B+ (85/100)Works10/10 validation stages, real code generation

19.6.2 GPU Inference Status

GPU inference uses wgpu (Vulkan/Metal/DX12) or CUDA (optional). GPU is mandatory for production eval.

Status (2026-03-28): FIXED — single-prompt AND batch mode working via wgpu.

  • Single-prompt apr run --gpu: wgpu (Vulkan), cosine=0.999863, token-for-token parity.
  • Batch --batch-jsonl: GH-560 FIXED (2026-03-28) — two bugs: FFN buffer overflow in trueno (attn_out_buf was hidden_dim=3584, needs intermediate_dim=18944; fix: ffn_silu_buf) + KV cache pre-filled length in realizar (vec![0.0; ...]Vec::with_capacity() + clear()). Verified on gx10: identical output to CPU, 1.1-2.0 tok/s on 7B. Contract-bound: gpu-weight-residency-v1 + gpu-multi-backend-parity-v1.
  • CPU batch (default): Proven reliable, ~3 hours for 164 HumanEval, 84.76-85.98% pass@1.

The CUDA cosine=-0.005 on sm_121 (GH-559) is NOT a JIT bug — falsification proved the PTX and JIT are both correct. Individual kernels produce correct results (RMSNorm diff=5e-7, Q GEMV ~1%). The -0.005 cosine is from FP32 accumulation ordering differences (GPU parallel vs CPU sequential) compounding through 28 layers × 10+ operations. wgpu avoids this by using the same accumulation order as CPU (cosine=0.999863).

See §25 (GPU Compute Architecture) for full specification, provable contracts, and roadmap.

Diagnostic trail (2026-03-25 → 2026-03-27):

HypothesisTestedResultFalsified by
RMSNorm kernel wrongGPU_DEBUG=1, CPU bypassIndividual RMSNorm diff=5e-7 (correct)Per-element comparison
Q4K GEMV kernel wrong5 PTX variantsAll produce cosine=1.0 via Python ctypesfalsify-ptx-implementations.py
NVIDIA JIT compiler bugSame PTX via Pythoncosine=1.0 (JIT correct)isolate-cuda-bug.py
Stream sync racebar.sync per layerFixes no-op layers, not cosinePer-layer sync test
FP32 accumulation orderingCorrect root causeNot falsified

Corrected root cause (2026-03-27): ~0.1% FP32 rounding per kernel × 280 operations → (1.001)^280 = 1.32 → cosine=-0.005. Individual kernels are correct (RMSNorm diff=5e-7, Q GEMV ~1%). PyTorch avoids this via TF32/FP64 accumulators. wgpu avoids it with sequential accumulation matching CPU.

Active tickets:

  • GH-560: CLOSED (2026-03-28) — wgpu batch fully working. Two-bug fix: trueno e24a6f6c + realizar e600bbff.
  • GH-561: IN PROGRESS — FP64 accumulators in NF4 GEMM forward + backward. Forward NF4 GEMM fixed previously (trueno 9e021c35, 81a9c16f). Backward GEMM (6 variants) now also fixed with f64 accumulators — training verified on gx10: loss 13.61→12.02, no NaN. Remaining: other kernels (RMSNorm backward, softmax backward, etc.) still use f32 accumulators but are lower priority — training converges without them.

19.6.3 apr serve for .apr Files

apr serve loads .apr models but the HTTP server doesn't bind. Serve may only be implemented for raw GGUF files. apr run works correctly for single-prompt inference.

19.6.4 Pipeline Ordering Validation

Recipe B (merge-alchemist) correctly emits a warning:

WARNING: Merge without finetune: merging untrained variants is suboptimal.

The §10 golden ordering enforcement works. The pipeline allows violation but warns.

19.6.5 Real Inference Verified

apr run checkpoints/qwen2.5-coder-1.5b-q4k.apr "def fibonacci(n):" --max-tokens 128 generates real Python code (Fibonacci implementation). GPU mandatory for production eval.

19.6.6 GPU Sharing Spec Complete

All three phases of the GPU-SHARE specification implemented and tested:

  • Phase 1: VRAM guard prevents OOM crashes. Ledger uses flock + atomic JSON write for crash safety. Wait queue polls until VRAM budget is available. MPS available as --experimental-mps opt-in.
  • Phase 2: Multi-adapter pipeline loads base model once, trains N LoRA adapters concurrently (3x VRAM savings for 3 adapters). Round-robin and priority scheduling. TOML config via --adapters-config.
  • Phase 3: Cluster config (YAML), job placement (VRAM-aware scoring), SSH transport (real std::process::Command, not stubs), checkpoint coordination with leaderboard, health check via SSH.

143 GPU tests pass. Zero SATD. Examples: gpu_ledger, multi_adapter_training, cluster_training.

19.6.7 QA Gate (2026-03-05)

apr qa on Qwen2.5-Coder-1.5B-Instruct Q4K: 6 PASS (capability, tensor contract, metadata, golden output, throughput, perf regression), 1 FAIL (format parity — GH-13: .apr-wrapped GGUF not recognized), 5 SKIP (no CUDA).

19.6.8 Perplexity Baseline (2026-03-05)

apr eval --dataset wikitext-2: perplexity 6.63, cross-entropy 1.89. Throughput: 2.5 tok/s on CPU, 385ms TTFT.

19.6.9 MBPP Eval (2026-03-29)

MBPP result: 74.80% pass@1 (374/500) few-shot 7B Q4K. Duplicate MBPP eval runs on Intel were killed — were burning 32 cores for 4 days with no additional value over the completed result.

19.6.10 Tokenizer Preservation Fix — GH-580 (2026-04-03)

Problem: All merge/quantize pipeline outputs lost embedded tokenizer, producing dead models that fail with PMAT-172 ERROR: APR file missing embedded tokenizer.

Five Whys:

  1. Why can't distilled model run inference? Missing tokenizer.
  2. Why missing? run_merge() used AprWriter (v1) which creates empty tokenizer.
  3. Why empty? AprWriter v1 only writes weight tensors, not metadata sections.
  4. Why not v2? Original code predated AprV2Writer.
  5. Why not caught? apr check passes (validates weights), but apr run fails (needs tokenizer for encoding).

Fix (GH-580): Read base model with AprV2Reader, clone metadata (preserving tokenizer), use AprV2Writer for output. Also supports SafeTensors adapter input from wgpu training pipeline. Contract: tokenizer-preservation-v1.yaml.

Impact: Unblocks PMAT-007 eval, PMAT-008 DPO merge, PMAT-010 TIES merge. All merge operations now produce runnable models.

19.6.11 PMAT-007 Distillation Pipeline Complete (2026-04-03)

Full text-based distillation pipeline ran on gx10:

  1. 99 teacher completions generated (32B model)
  2. Combined with instruct corpus (15,326 lines)
  3. QLoRA training: 7B on combined data, rank=32
  4. Adapter exported: 40 MB safetensors
  5. Merged into base 7B model (GH-580 fix)
  6. Quantized to Q4K (6.2 GB)

Awaiting: HumanEval + MBPP evaluation of distilled Q4K model.