CLI Toolchain

Two layers work together: apr (upstream aprender — ML operations) and make (this repo — orchestration via Makefile + shell scripts). Every technique maps to a single shell command. Our competitors use 500-line Python scripts; we use one-liners.

6.1 The `apr` CLI (aprender)

The upstream apr binary provides all ML operations. The Makefile and shell scripts call these under the hood.

6.1.1 Import (HF → APR)

# Import from HuggingFace Hub — auto-detects architecture
apr import hf://Qwen/Qwen2.5-Coder-7B -o qwen-7b.apr --arch qwen2

# Import with quantization on ingest
apr import hf://Qwen/Qwen2.5-Coder-32B -o qwen-32b-q8.apr --quantize int8

# Import GGUF with provenance enforcement
apr import qwen-7b.gguf -o qwen-7b.apr --enforce-provenance

6.1.2 Batch Inference (GH-batch)

# Batch inference: load model + CUDA JIT once, process all prompts sequentially
# Eliminates ~80s per-invocation overhead on gx10 sm_121 Blackwell GPU
apr run model.apr --batch-jsonl prompts.jsonl --max-tokens 512

# GPU: auto-dispatches CUDA → wgpu (Vulkan) → CPU.
# wgpu batch WORKS (GH-560 fixed 2026-03-28): identical output to CPU, 1.1-2.0 tok/s on 7B.
# CUDA still broken (cosine=-0.005, GH-561 pending). wgpu is the production GPU path.

Input format (JSONL):

{"prompt": "def fibonacci(n):", "task_id": "HumanEval/0", "max_tokens": 512}
{"prompt": "def add(a, b):", "task_id": "HumanEval/1"}

Output format (JSONL, one line per prompt):

{"task_id": "HumanEval/0", "text": "...", "tokens_generated": 85, "tok_per_sec": 14.2, "inference_ms": 5986.0, "used_gpu": true}

Sampling flags (also available in batch mode):

Flag	Default	Description
`--temperature`	`0.0`	Sampling temperature (0.0 = greedy)
`--top-k`	`1`	Top-k sampling (1 = greedy)

Auto-detects model format (GGUF or APR). GPU auto-dispatches: CUDA (parity gate) → wgpu (Vulkan) → CPU. On Blackwell sm_121, CUDA blocked by parity gate (cosine=-0.005, GH-561). wgpu batch works after GH-560 two-bug fix: FFN buffer overflow in trueno (attn_out_buf was hidden_dim, needs intermediate_dim) + KV cache pre-filled length in realizar. Never bypass the parity gate — fix root cause. Model stays resident across all prompts.

6.1.3 Evaluate (Baseline)

# Perplexity baseline
apr eval qwen-7b.apr --dataset wikitext-2 --threshold 20.0

# Classification eval with custom data
apr eval qwen-7b.apr --task classify --data humaneval.jsonl --json

6.1.4 Instruction Fine-tuning (GH-371)

# Instruction fine-tuning with LoRA on Q/V projections
apr finetune model.apr --task instruct --data instruct.jsonl --epochs 3 --rank 16

# QLoRA on consumer GPU (NF4 base + FP16 adapters, ~4.5 GB VRAM)
apr finetune model.apr --task instruct --method qlora --quantize-nf4 \
    --data instruct.jsonl --rank 16 --vram 8 --max-seq-len 512

# Multi-adapter concurrent training (GPU-SHARE)
apr finetune model.apr --task instruct --method qlora --quantize-nf4 \
    --adapters-config adapters.toml

# With experimental multi-process GPU sharing
apr finetune model.apr --task instruct --experimental-mps --gpu-share 50

# Plan-only mode (shows config without training)
apr finetune --task instruct --model-size 7B --plan

Corpus format (JSONL):

{"instruction": "Write a function that...", "response": "def foo():\n    ..."}
{"instruction": "...", "response": "...", "system": "You are...", "metadata": {"source": "depyler"}}

Adapters config format (TOML):

[[adapter]]
data = "data/corpus-a.jsonl"
checkpoint = "checkpoints/adapter-a"
label = "code-review"
rank = 16
learning_rate = 0.0002

Contracts:

F-INST-001: Non-empty instruction and response
F-INST-002: Cross-entropy loss computed only on response tokens
F-INST-003: Perplexity reported per epoch
F-INST-004: Qwen chat template (<|im_start|> / <|im_end|>)
GPU-SHARE-002: VRAM reservation via ledger before allocation

6.1.5 Full Optimization Pipeline (preview)

# The complete leaderboard recipe in 6 commands (follows golden ordering §10):
apr import hf://Qwen/Qwen2.5-Coder-7B -o base.apr
apr distill teacher.apr --student base.apr --strategy progressive --temperature 3.0 -o distilled.apr
apr finetune distilled.apr --method qlora --rank 32 --data code-instruct.jsonl -o tuned.apr
apr merge tuned.apr variant-b.apr --strategy slerp -o merged.apr
apr prune merged.apr --method wanda --target-ratio 0.2 --calibration calib.jsonl -o pruned.apr
apr quantize pruned.apr --scheme int4 -o submit.apr

6.2 The `make` Orchestration Layer (this repo)

The orchestration layer that drives the pipeline. Each Makefile target maps to one or more apr CLI subcommands or shell scripts.

Make Target	Calls	Description
`make import`	`apr import`	Download HF model → `.apr` format
`make prep-data`	`apr data prep`	Extract instruction/response pairs from Python source (GH-7)
`make eval-humaneval`	`scripts/eval-pass-at-k.sh`	Generate completions → sandbox execute → pass@k
`make eval-mbpp`	`scripts/eval-pass-at-k.sh`	Same pipeline, MBPP dataset
`make eval-bigcodebench`	`scripts/eval-pass-at-k.sh`	Same pipeline, BigCodeBench dataset
`make eval-all`	`scripts/eval-pass-at-k.sh` × 3	All benchmarks sequentially
`make eval-perplexity`	`apr eval --dataset wikitext-2`	Perplexity baseline
`make finetune-instruct`	`apr finetune --task instruct`	Instruction LoRA fine-tuning (GH-371)
`make finetune`	`apr finetune`	Classification LoRA/QLoRA fine-tuning
`make align`	`apr finetune --method dpo/orpo`	DPO/ORPO preference alignment (GH-8)
`make distill`	`apr distill`	Knowledge distillation (teacher → student)
`make merge`	`apr merge`	Model merging (SLERP, TIES, DARE, linear)
`make prune`	`apr prune`	Structured/unstructured pruning
`make quantize`	`apr quantize`	Post-training quantization
`make compile`	`apr compile`	Compile model to standalone binary
`make check`	`apr check`	Validate APR format and integrity
`make inspect`	`apr inspect`	Model inspection
`make export`	`apr export`	SafeTensors/GGUF export
`make publish`	`scripts/submit.sh`	Export + model card + HF Hub upload
`make model-card`	`apr eval --generate-card`	Generate model card
`make pipeline`	`scripts/pipeline.sh`	Config-driven end-to-end pipeline (12 stages)
`make pipeline-plan`	`scripts/pipeline.sh --plan`	Dry-run: validate config, show commands
`make verify`	smoke-tests all `apr` subcommands	Validate `apr` CLI installation
`make dogfood`	CLI + config validation	End-to-end smoke test
`make validate`	`bashrs config lint` + `bashrs lint`	Lint all configs + scripts
`make prove-wgpu`	`scripts/prove-wgpu.sh`	wgpu GPU training proof
`make import-plan`	HF Hub check + dry-run	Import plan preview
`make prep-data-audit`	`apr data audit --verbose`	Detailed corpus audit
`make decontaminate`	`apr data decontaminate`	N-gram overlap gate (AC-016)
`make data-quality`	`apr data quality`	Quality scoring gate (AC-025)
`make qa`	`apr qa --verbose`	Full model QA gate
`make compare-hf`	`apr compare-hf --hf MODEL --json`	HF parity check
`make bench`	`apr bench --json`	Throughput benchmark (tok/s, TTFT)
`make data-split`	`apr data split`	Stratified train/val/test split
`make data-balance`	`apr data balance`	Resample for class balance
`make benchmark-download`	`scripts/download-benchmarks.sh`	Download HumanEval/MBPP data
`make results-history`	`scripts/results-history.sh`	View and compare eval results
`make eval-sweep`	`scripts/eval-sweep.sh`	Sweep all result JSONs, tabulate pass@k across models
`make compare-results`	`scripts/compare-results.sh`	Delta analysis between two result files
`make leaderboard`	`scripts/leaderboard-summary.sh`	Generate ranked markdown leaderboard from results
`make check-contracts`	inline awk + jq + python3	Run falsification tests (pass@k, throughput, structure)
`make clean`	`rm -rf checkpoints/ results/`	Remove build artifacts
`make book`	`mdbook build`	Build specification book
`make docs`	`mdbook build`	Alias for book
`make docs-serve`	`mdbook serve`	Local book preview

6.2.1 Import

# Import a HuggingFace model to .apr format
make import MODEL=Qwen/Qwen2.5-Coder-7B-Instruct

# Import with custom output path
make import MODEL=Qwen/Qwen2.5-Coder-7B CHECKPOINT=checkpoints/qwen7b.apr

# Import via standalone script (with validation)
./scripts/import.sh Qwen/Qwen2.5-Coder-7B checkpoints/qwen7b.apr

6.2.2 Eval

# Run HumanEval with defaults (512 tokens, temperature 0.0, 1 sample, standard prompt)
make eval-humaneval CHECKPOINT=checkpoints/qwen-7b.apr

# Full benchmark suite
make eval-all CHECKPOINT=checkpoints/qwen-7b.apr

# Custom parameters with structured chain-of-thought prompting
make eval-humaneval CHECKPOINT=checkpoints/qwen-7b.apr \
    MAX_TOKENS=1024 TEMPERATURE=0.2 NUM_SAMPLES=10 PROMPT_STRATEGY=scot

# Perplexity baseline
make eval-perplexity CHECKPOINT=checkpoints/qwen-7b.apr

Variable	Default	Description
`MAX_TOKENS`	`512`	Max tokens per completion
`TEMPERATURE`	`0.0`	Sampling temperature
`NUM_SAMPLES`	`1`	Completions per problem (for pass@k)
`PROMPT_STRATEGY`	`standard`	Prompt strategy: `standard`, `scot`, `few-shot`, `cgo`

The eval script (scripts/eval-pass-at-k.sh) handles the full pipeline:

Downloads benchmark data (HumanEval, MBPP, BigCodeBench) if not cached
For each problem: generates completion via apr run with chosen prompt strategy
Strips markdown fences, combines completion + test cases
Executes in python3/Docker sandbox with timeout 10
Computes pass@k via Chen et al. unbiased estimator and writes result JSON

6.2.3 Data Preparation

# Audit instruction corpus quality
make prep-data

# Detailed audit output
make prep-data-audit

Data preparation uses apr data prep (GH-7) to extract function/class definitions with docstrings from ground truth corpora via Rust AST parsing (tree-sitter). Sources:

depyler (~11.8K pairs): Python algorithms, data structures, CLI examples
hf-gtc (~3.5K pairs): HuggingFace production recipes
jax-gtc (~58 pairs): JAX numerical computing patterns
vllm-gtc (~81 pairs): vLLM inference optimization patterns

Total: ~15.5K instruction/response pairs in JSONL format.

6.2.4 Finetune

# Instruction fine-tuning with data from ground truth corpora (GH-371)
make prep-data                    # generate data/instruct-corpus.jsonl
make finetune-instruct            # defaults: model_size=7B, rank=16, lr=0.0002, 3 epochs

# Custom instruction fine-tuning config
make finetune-instruct MODEL_SIZE=7B RANK=32 LR=0.001 EPOCHS=5

# Classification LoRA fine-tune (original path)
make finetune CHECKPOINT=checkpoints/qwen-7b.apr DATA=data/code-instruct.jsonl

# QLoRA with custom config
make finetune CHECKPOINT=checkpoints/qwen-7b.apr DATA=data/code-instruct.jsonl \
    METHOD=qlora RANK=32 LR=0.001 EPOCHS=5

Tasks: instruct (generative, GH-371), classify (classification). Methods: lora (default), qlora (quantized LoRA), full (all parameters).

Variable	Default	Description
`METHOD`	`lora`	Fine-tuning method
`RANK`	`16`	LoRA rank
`LR`	`0.0002`	Learning rate
`EPOCHS`	`3`	Number of epochs
`DATA`	`data/instruct-corpus.jsonl`	Training dataset
`MODEL_SIZE`	`7B`	Model size for instruct task (tiny/0.5B/7B/9B)

6.2.4 Distill

# Progressive distillation (recommended for code models)
make distill TEACHER=checkpoints/teacher-32b.apr STUDENT=checkpoints/student-7b.apr \
    DIST_STRATEGY=progressive DIST_TEMP=3.0 DIST_ALPHA=0.7

Strategies: standard (KL divergence), progressive (curriculum learning), ensemble (multi-teacher).

Variable	Default	Description
`DIST_STRATEGY`	`standard`	Distillation strategy
`DIST_TEMP`	`3.0`	Softmax temperature
`DIST_ALPHA`	`0.7`	Mixing coefficient (0=student, 1=teacher)

6.2.5 Merge

# SLERP merge of two models
make merge MODELS="checkpoints/a.apr checkpoints/b.apr" STRATEGY=slerp

# TIES merge (set via recipe YAML for full control)
make pipeline RECIPE=recipe-b-merge-alchemist

Strategies: slerp, ties (TIES-Merging), dare (DARE-TIES), linear (linear average).

6.2.6 Prune

# Wanda pruning with 50% sparsity (default)
make prune CHECKPOINT=checkpoints/tuned.apr PRUNE_METHOD=wanda SPARSITY=0.5

# Magnitude pruning
make prune CHECKPOINT=checkpoints/tuned.apr PRUNE_METHOD=magnitude SPARSITY=0.3

Methods: wanda (default), magnitude, sparsegpt. Sparsity: 0.0–1.0.

6.2.7 Quantize

# INT4 quantization
make quantize CHECKPOINT=checkpoints/pruned.apr SCHEME=int4

# Q6K quantization
make quantize CHECKPOINT=checkpoints/pruned.apr SCHEME=q6k

Schemes: int4, int8, q4k, q5k, q6k.

6.2.8 Pipeline (config-driven)

# Run entire pipeline from a recipe YAML config
make pipeline RECIPE=recipe-a-quick-lora

# Dry-run: show commands without executing
make pipeline-plan RECIPE=recipe-c-full-pipeline

The pipeline script (scripts/pipeline.sh) reads a recipe YAML and runs each stage in order:

import → [distill] → [finetune] → [align] → [merge] → [prune] → [quantize] → eval → [submit] → [compile]

Stages in brackets are optional — only included if the corresponding YAML section exists.

6.2.9 Submit

# Export and publish to HuggingFace Hub
make publish CHECKPOINT=checkpoints/model.apr HF_REPO=paiml/qwen-coder-7b-apr

# Export only (SafeTensors)
make export CHECKPOINT=checkpoints/model.apr EXPORT_FORMAT=safetensors

The submit script (scripts/submit.sh):

Exports model to SafeTensors via apr export
Generates model card with benchmark results table
Dry-run preview via apr publish --dry-run
Prompts for confirmation before actual upload

6.2.10 Verification

# Verify apr CLI and all subcommands
make verify

# End-to-end smoke test (CLI + configs)
make dogfood

6.3 Orchestration Surface Mapping

The full mapping between Makefile targets and apr CLI operations:

make pipeline RECIPE=recipe-c-full-pipeline
    │
    │  scripts/pipeline.sh reads YAML, runs stages:
    │
    ├── [import]    ──► apr import hf://... -o checkpoints/base.apr
    ├── [distill]   ──► apr distill teacher.apr --student base.apr -o distilled.apr
    ├── [finetune]  ──► apr finetune distilled.apr --method lora -o tuned.apr
    ├── [align]     ──► apr finetune tuned.apr --method dpo -o aligned.apr
    ├── [merge]     ──► apr merge aligned.apr variant.apr --strategy slerp -o merged.apr
    ├── [prune]     ──► apr prune merged.apr --method wanda -o pruned.apr
    ├── [quantize]  ──► apr quantize pruned.apr --scheme int4 -o quantized.apr
    ├── [eval]      ──► scripts/eval-pass-at-k.sh humaneval quantized.apr
    ├── [submit]    ──► scripts/submit.sh quantized.apr org/model
    └── [compile]   ──► apr compile quantized.apr --release --lto --strip

APR Leaderboard Specification