What This Repo Does

1.1 Purpose

apr-leaderboard is a pipeline harness that proves the sovereign AI stack — aprender, entrenar, trueno — can compete on HuggingFace code generation leaderboards (HumanEval, MBPP, BigCodeBench) without Python, without the HuggingFace Transformers library, and without GPU vendor lock-in.

It is not a model training framework. It is not a general ML toolkit. It is a thin orchestration layer — a Makefile (57 targets), 24 shell scripts, 22 YAML configs, 29 provable contracts, a batuta playbook, and a forjar infrastructure manifest — that wires the sovereign stack's existing capabilities into a reproducible, config-driven leaderboard pipeline:

apr import → apr distill → apr finetune → apr merge → apr prune → apr quantize → apr eval → apr submit

Every command above is provided by aprender (apr CLI). This repo provides the pipeline config, benchmark metadata, result persistence, and the spec that defines the strategy.

1.2 What It Proves

This repo exists to answer one falsifiable question:

Can a single Rust binary (apr) match Python-ecosystem HumanEval/MBPP scores for Qwen2.5-Coder-7B, with zero Python dependencies?

If the answer is yes, it proves:

aprender can import, infer, and evaluate HuggingFace models via the .apr format
entrenar can fine-tune those models with LoRA/QLoRA using its own autograd engine
trueno can run transformer attention at competitive throughput via SIMD (CPU) and wgpu (any GPU)
The full distill → finetune → merge → prune → quantize pipeline works end-to-end in pure Rust — on any GPU vendor
provable-contracts kernel verification (Kani bounded model checking) doesn't prevent competitive performance — correctness and speed coexist

If the answer is no, it identifies exactly where the sovereign stack falls short (inference parity gap, training convergence, quantization quality loss) via apr compare-hf.

1.3 How It Relates to aprender

┌──────────────────────────────────────────────────────────┐
│                    apr-leaderboard                        │
│                                                          │
│  Makefile           YAML configs        Shell scripts    │
│  (dev convenience)  (models/recipes/   (24 scripts)     │
│                      eval/pipeline)                      │
│                                                          │
│  ┌──────────────── calls ─────────────────────────────┐  │
│  │                                                     │  │
│  ▼                                                     │  │
│  ┌──────────────────────────────────────────────────┐  │  │
│  │              aprender (apr CLI)                   │  │  │
│  │                                                   │  │  │
│  │  import   distill   finetune   merge   prune     │  │  │
│  │  quantize  eval    bench    compile   chat       │  │  │
│  │  compare-hf  qa    check   publish   export      │  │  │
│  │                                                   │  │  │
│  │  ┌─────────┐  ┌──────────┐  ┌─────────┐         │  │  │
│  │  │ entrenar│  │  trueno   │  │provable │         │  │  │
│  │  │ LoRA    │  │  SIMD     │  │contracts│         │  │  │
│  │  │ QLoRA   │  │  AVX2/NEON│  │ Kani    │         │  │  │
│  │  │ AdamW   │  │  wgpu GPU │  │ L1-L4   │         │  │  │
│  │  │ autograd│  │  Q4K/Q6K  │  │ proofs  │         │  │  │
│  │  └─────────┘  └──────────┘  └─────────┘         │  │  │
│  └──────────────────────────────────────────────────┘  │  │
│                                                        │  │
└────────────────────────────────────────────────────────┘  │
                                                            │
   pmat comply ◄───── quality gate ─────────────────────────┘

apr-leaderboard does NOT reimplement aprender. It calls apr subcommands via Makefile targets and shell scripts. The relationship is:

Layer	Repo	Responsibility
Orchestration	apr-leaderboard	Makefile targets, shell scripts, pipeline configs, benchmark metadata, result tracking, strategy spec
ML Operations	aprender (apr CLI)	Model import, inference, eval, distillation, merging, pruning, quantization
Training	entrenar	LoRA/QLoRA, autograd, optimizers, gradient checkpointing
Compute	trueno	SIMD tensor ops, wgpu GPU kernels, quantized matmul
Correctness	provable-contracts	Kernel contracts, Kani proofs, falsification tests
Quality	pmat comply	Compliance checks, spec scoring, cross-crate consistency

1.4 Current Implementation Status

All orchestration is implemented via Makefile + shell scripts. Every make target calls real apr CLI subcommands.

Component	Status	What It Does
Makefile	Working	Dev convenience: import, finetune, merge, prune, quantize, distill, compile, eval-*, export, publish, pipeline, verify, validate, dogfood, prove-wgpu
scripts/eval-pass-at-k.sh	Working	Downloads benchmark data, generates completions via `apr run`, executes in sandbox, computes pass@k
scripts/pipeline.sh	Working	Parses recipe YAML (bash-native, zero Python), runs stages sequentially, supports `--plan` dry-run and explicit `stages:` list
scripts/submit.sh	Working	Exports to SafeTensors, generates model card, publishes to HF Hub with dry-run confirmation
scripts/import.sh	Working	Wraps `apr import` with HF Hub reachability check and `apr check` validation
scripts/prove-wgpu.sh	Working	End-to-end wgpu training proof: import → QLoRA train → verify GPU backend
configs/models/	Complete	7 YAML model configs (Qwen-7B, Qwen-32B, Qwen-1.5B, Qwen3-4B, Qwen3-8B, DeepSeek-R1-7B, Phi-4)
configs/recipes/	Complete	11 YAML recipe configs (A-K: quick-lora, merge-alchemist, full-pipeline, sovereign-binary, instruct-finetune, qwen3-qlora, wgpu-proof, 32b-distill, humaneval-qlora, merge-specialists, final-artifact)
configs/eval/	Complete	Eval suite YAML with benchmark definitions, targets, and baselines
configs/pipeline/	Complete	Forjar infra manifest + batuta playbook DAG
data_catalog.yaml	Complete	Data governance: datasets, lineage, classification, lifecycle
docs/	Complete	Strategy spec (mdbook), 27 sections covering full pipeline

Quality: All 22 YAML configs valid (make validate), 24 scripts, 19/19 apr subcommands verified, 29 provable contracts with 96 proof obligations. Real model import and inference tested with Qwen2.5-Coder-1.5B, 7B, 32B, and Qwen3-4B. Zero Python scripts. Zero TOML configs (migrated to YAML). Chen et al. unbiased pass@k estimator. 5 prompt strategies (standard, scot, few-shot, cgo, default). Best results: HumanEval 90.85% (32B), 87.20% (7B few-shot), MBPP 76.20% (7B + test assertions).

GPU sharing infrastructure: 143 tests across 9 entrenar modules (VRAM guard, ledger, wait queue, profiler, MPS, cluster config, placement, coordinator, multi-adapter pipeline). See §22 for details.

1.5 How People Use It

For leaderboard competitors:

# 1. Verify the pipeline
make verify

# 2. Import a model from HuggingFace
make import MODEL=Qwen/Qwen2.5-Coder-7B-Instruct

# 3. Evaluate on benchmarks
make eval-humaneval CHECKPOINT=checkpoints/qwen_qwen2.5-coder-7b-instruct.apr
make eval-all CHECKPOINT=checkpoints/qwen_qwen2.5-coder-7b-instruct.apr

# 4. Optimize (quantize, prune, merge, etc.)
make quantize CHECKPOINT=checkpoints/base.apr SCHEME=int4
make prune CHECKPOINT=checkpoints/base.apr PRUNE_METHOD=wanda SPARSITY=0.5

# 5. Run a full recipe pipeline
make pipeline RECIPE=recipe-a-quick-lora

# 6. Submit to HuggingFace Hub
make publish CHECKPOINT=checkpoints/model.apr HF_REPO=org/model-name

For sovereign stack developers:

This repo is an integration test for the sovereign stack. If make pipeline produces competitive scores, the stack works. If it doesn't, the per-step eval results pinpoint the weak component.

# Run baseline parity check
make import MODEL=Qwen/Qwen2.5-Coder-7B-Instruct
apr run checkpoints/qwen_qwen2.5-coder-7b-instruct.apr \
    --prompt "def fibonacci(n):" --max-tokens 256
apr eval checkpoints/qwen_qwen2.5-coder-7b-instruct.apr --dataset wikitext-2
apr bench checkpoints/qwen_qwen2.5-coder-7b-instruct.apr --json

For researchers:

The spec (this document) is the experimental protocol. The recipes in §9 are reproducible experiments. The acceptance criteria in §18 are the pass/fail conditions. Run them, report results, falsify or validate the thesis.