CLI Toolchain
Two layers work together: apr (upstream aprender — ML operations) and make (this repo — orchestration via Makefile + shell scripts). Every technique maps to a single shell command. Our competitors use 500-line Python scripts; we use one-liners.
6.1 The apr CLI (aprender)
The upstream apr binary provides all ML operations. The Makefile and shell scripts call these under the hood.
6.1.1 Import (HF → APR)
# Import from HuggingFace Hub — auto-detects architecture
apr import hf://Qwen/Qwen2.5-Coder-7B -o qwen-7b.apr --arch qwen2
# Import with quantization on ingest
apr import hf://Qwen/Qwen2.5-Coder-32B -o qwen-32b-q8.apr --quantize int8
# Import GGUF with provenance enforcement
apr import qwen-7b.gguf -o qwen-7b.apr --enforce-provenance
6.1.2 Batch Inference (GH-batch)
# Batch inference: load model + CUDA JIT once, process all prompts sequentially
# Eliminates ~80s per-invocation overhead on gx10 sm_121 Blackwell GPU
apr run model.apr --batch-jsonl prompts.jsonl --max-tokens 512
# GPU: auto-dispatches CUDA → wgpu (Vulkan) → CPU.
# wgpu batch WORKS (GH-560 fixed 2026-03-28): identical output to CPU, 1.1-2.0 tok/s on 7B.
# CUDA still broken (cosine=-0.005, GH-561 pending). wgpu is the production GPU path.
Input format (JSONL):
{"prompt": "def fibonacci(n):", "task_id": "HumanEval/0", "max_tokens": 512}
{"prompt": "def add(a, b):", "task_id": "HumanEval/1"}
Output format (JSONL, one line per prompt):
{"task_id": "HumanEval/0", "text": "...", "tokens_generated": 85, "tok_per_sec": 14.2, "inference_ms": 5986.0, "used_gpu": true}
Sampling flags (also available in batch mode):
| Flag | Default | Description |
|---|---|---|
--temperature | 0.0 | Sampling temperature (0.0 = greedy) |
--top-k | 1 | Top-k sampling (1 = greedy) |
Auto-detects model format (GGUF or APR). GPU auto-dispatches: CUDA (parity gate) → wgpu (Vulkan) → CPU. On Blackwell sm_121, CUDA blocked by parity gate (cosine=-0.005, GH-561). wgpu batch works after GH-560 two-bug fix: FFN buffer overflow in trueno (attn_out_buf was hidden_dim, needs intermediate_dim) + KV cache pre-filled length in realizar. Never bypass the parity gate — fix root cause. Model stays resident across all prompts.
6.1.3 Evaluate (Baseline)
# Perplexity baseline
apr eval qwen-7b.apr --dataset wikitext-2 --threshold 20.0
# Classification eval with custom data
apr eval qwen-7b.apr --task classify --data humaneval.jsonl --json
6.1.4 Instruction Fine-tuning (GH-371)
# Instruction fine-tuning with LoRA on Q/V projections
apr finetune model.apr --task instruct --data instruct.jsonl --epochs 3 --rank 16
# QLoRA on consumer GPU (NF4 base + FP16 adapters, ~4.5 GB VRAM)
apr finetune model.apr --task instruct --method qlora --quantize-nf4 \
--data instruct.jsonl --rank 16 --vram 8 --max-seq-len 512
# Multi-adapter concurrent training (GPU-SHARE)
apr finetune model.apr --task instruct --method qlora --quantize-nf4 \
--adapters-config adapters.toml
# With experimental multi-process GPU sharing
apr finetune model.apr --task instruct --experimental-mps --gpu-share 50
# Plan-only mode (shows config without training)
apr finetune --task instruct --model-size 7B --plan
Corpus format (JSONL):
{"instruction": "Write a function that...", "response": "def foo():\n ..."}
{"instruction": "...", "response": "...", "system": "You are...", "metadata": {"source": "depyler"}}
Adapters config format (TOML):
[[adapter]]
data = "data/corpus-a.jsonl"
checkpoint = "checkpoints/adapter-a"
label = "code-review"
rank = 16
learning_rate = 0.0002
Contracts:
- F-INST-001: Non-empty instruction and response
- F-INST-002: Cross-entropy loss computed only on response tokens
- F-INST-003: Perplexity reported per epoch
- F-INST-004: Qwen chat template (
<|im_start|>/<|im_end|>) - GPU-SHARE-002: VRAM reservation via ledger before allocation
6.1.5 Full Optimization Pipeline (preview)
# The complete leaderboard recipe in 6 commands (follows golden ordering §10):
apr import hf://Qwen/Qwen2.5-Coder-7B -o base.apr
apr distill teacher.apr --student base.apr --strategy progressive --temperature 3.0 -o distilled.apr
apr finetune distilled.apr --method qlora --rank 32 --data code-instruct.jsonl -o tuned.apr
apr merge tuned.apr variant-b.apr --strategy slerp -o merged.apr
apr prune merged.apr --method wanda --target-ratio 0.2 --calibration calib.jsonl -o pruned.apr
apr quantize pruned.apr --scheme int4 -o submit.apr
6.2 The make Orchestration Layer (this repo)
The orchestration layer that drives the pipeline. Each Makefile target maps to one or more apr CLI subcommands or shell scripts.
| Make Target | Calls | Description |
|---|---|---|
make import | apr import | Download HF model → .apr format |
make prep-data | apr data prep | Extract instruction/response pairs from Python source (GH-7) |
make eval-humaneval | scripts/eval-pass-at-k.sh | Generate completions → sandbox execute → pass@k |
make eval-mbpp | scripts/eval-pass-at-k.sh | Same pipeline, MBPP dataset |
make eval-bigcodebench | scripts/eval-pass-at-k.sh | Same pipeline, BigCodeBench dataset |
make eval-all | scripts/eval-pass-at-k.sh × 3 | All benchmarks sequentially |
make eval-perplexity | apr eval --dataset wikitext-2 | Perplexity baseline |
make finetune-instruct | apr finetune --task instruct | Instruction LoRA fine-tuning (GH-371) |
make finetune | apr finetune | Classification LoRA/QLoRA fine-tuning |
make align | apr finetune --method dpo/orpo | DPO/ORPO preference alignment (GH-8) |
make distill | apr distill | Knowledge distillation (teacher → student) |
make merge | apr merge | Model merging (SLERP, TIES, DARE, linear) |
make prune | apr prune | Structured/unstructured pruning |
make quantize | apr quantize | Post-training quantization |
make compile | apr compile | Compile model to standalone binary |
make check | apr check | Validate APR format and integrity |
make inspect | apr inspect | Model inspection |
make export | apr export | SafeTensors/GGUF export |
make publish | scripts/submit.sh | Export + model card + HF Hub upload |
make model-card | apr eval --generate-card | Generate model card |
make pipeline | scripts/pipeline.sh | Config-driven end-to-end pipeline (12 stages) |
make pipeline-plan | scripts/pipeline.sh --plan | Dry-run: validate config, show commands |
make verify | smoke-tests all apr subcommands | Validate apr CLI installation |
make dogfood | CLI + config validation | End-to-end smoke test |
make validate | bashrs config lint + bashrs lint | Lint all configs + scripts |
make prove-wgpu | scripts/prove-wgpu.sh | wgpu GPU training proof |
make import-plan | HF Hub check + dry-run | Import plan preview |
make prep-data-audit | apr data audit --verbose | Detailed corpus audit |
make decontaminate | apr data decontaminate | N-gram overlap gate (AC-016) |
make data-quality | apr data quality | Quality scoring gate (AC-025) |
make qa | apr qa --verbose | Full model QA gate |
make compare-hf | apr compare-hf --hf MODEL --json | HF parity check |
make bench | apr bench --json | Throughput benchmark (tok/s, TTFT) |
make data-split | apr data split | Stratified train/val/test split |
make data-balance | apr data balance | Resample for class balance |
make benchmark-download | scripts/download-benchmarks.sh | Download HumanEval/MBPP data |
make results-history | scripts/results-history.sh | View and compare eval results |
make eval-sweep | scripts/eval-sweep.sh | Sweep all result JSONs, tabulate pass@k across models |
make compare-results | scripts/compare-results.sh | Delta analysis between two result files |
make leaderboard | scripts/leaderboard-summary.sh | Generate ranked markdown leaderboard from results |
make check-contracts | inline awk + jq + python3 | Run falsification tests (pass@k, throughput, structure) |
make clean | rm -rf checkpoints/ results/ | Remove build artifacts |
make book | mdbook build | Build specification book |
make docs | mdbook build | Alias for book |
make docs-serve | mdbook serve | Local book preview |
6.2.1 Import
# Import a HuggingFace model to .apr format
make import MODEL=Qwen/Qwen2.5-Coder-7B-Instruct
# Import with custom output path
make import MODEL=Qwen/Qwen2.5-Coder-7B CHECKPOINT=checkpoints/qwen7b.apr
# Import via standalone script (with validation)
./scripts/import.sh Qwen/Qwen2.5-Coder-7B checkpoints/qwen7b.apr
6.2.2 Eval
# Run HumanEval with defaults (512 tokens, temperature 0.0, 1 sample, standard prompt)
make eval-humaneval CHECKPOINT=checkpoints/qwen-7b.apr
# Full benchmark suite
make eval-all CHECKPOINT=checkpoints/qwen-7b.apr
# Custom parameters with structured chain-of-thought prompting
make eval-humaneval CHECKPOINT=checkpoints/qwen-7b.apr \
MAX_TOKENS=1024 TEMPERATURE=0.2 NUM_SAMPLES=10 PROMPT_STRATEGY=scot
# Perplexity baseline
make eval-perplexity CHECKPOINT=checkpoints/qwen-7b.apr
| Variable | Default | Description |
|---|---|---|
MAX_TOKENS | 512 | Max tokens per completion |
TEMPERATURE | 0.0 | Sampling temperature |
NUM_SAMPLES | 1 | Completions per problem (for pass@k) |
PROMPT_STRATEGY | standard | Prompt strategy: standard, scot, few-shot, cgo |
The eval script (scripts/eval-pass-at-k.sh) handles the full pipeline:
- Downloads benchmark data (HumanEval, MBPP, BigCodeBench) if not cached
- For each problem: generates completion via
apr runwith chosen prompt strategy - Strips markdown fences, combines completion + test cases
- Executes in python3/Docker sandbox with
timeout 10 - Computes pass@k via Chen et al. unbiased estimator and writes result JSON
6.2.3 Data Preparation
# Audit instruction corpus quality
make prep-data
# Detailed audit output
make prep-data-audit
Data preparation uses apr data prep (GH-7) to extract function/class
definitions with docstrings from ground truth corpora via Rust AST parsing
(tree-sitter). Sources:
- depyler (~11.8K pairs): Python algorithms, data structures, CLI examples
- hf-gtc (~3.5K pairs): HuggingFace production recipes
- jax-gtc (~58 pairs): JAX numerical computing patterns
- vllm-gtc (~81 pairs): vLLM inference optimization patterns
Total: ~15.5K instruction/response pairs in JSONL format.
6.2.4 Finetune
# Instruction fine-tuning with data from ground truth corpora (GH-371)
make prep-data # generate data/instruct-corpus.jsonl
make finetune-instruct # defaults: model_size=7B, rank=16, lr=0.0002, 3 epochs
# Custom instruction fine-tuning config
make finetune-instruct MODEL_SIZE=7B RANK=32 LR=0.001 EPOCHS=5
# Classification LoRA fine-tune (original path)
make finetune CHECKPOINT=checkpoints/qwen-7b.apr DATA=data/code-instruct.jsonl
# QLoRA with custom config
make finetune CHECKPOINT=checkpoints/qwen-7b.apr DATA=data/code-instruct.jsonl \
METHOD=qlora RANK=32 LR=0.001 EPOCHS=5
Tasks: instruct (generative, GH-371), classify (classification).
Methods: lora (default), qlora (quantized LoRA), full (all parameters).
| Variable | Default | Description |
|---|---|---|
METHOD | lora | Fine-tuning method |
RANK | 16 | LoRA rank |
LR | 0.0002 | Learning rate |
EPOCHS | 3 | Number of epochs |
DATA | data/instruct-corpus.jsonl | Training dataset |
MODEL_SIZE | 7B | Model size for instruct task (tiny/0.5B/7B/9B) |
6.2.4 Distill
# Progressive distillation (recommended for code models)
make distill TEACHER=checkpoints/teacher-32b.apr STUDENT=checkpoints/student-7b.apr \
DIST_STRATEGY=progressive DIST_TEMP=3.0 DIST_ALPHA=0.7
Strategies: standard (KL divergence), progressive (curriculum learning), ensemble (multi-teacher).
| Variable | Default | Description |
|---|---|---|
DIST_STRATEGY | standard | Distillation strategy |
DIST_TEMP | 3.0 | Softmax temperature |
DIST_ALPHA | 0.7 | Mixing coefficient (0=student, 1=teacher) |
6.2.5 Merge
# SLERP merge of two models
make merge MODELS="checkpoints/a.apr checkpoints/b.apr" STRATEGY=slerp
# TIES merge (set via recipe YAML for full control)
make pipeline RECIPE=recipe-b-merge-alchemist
Strategies: slerp, ties (TIES-Merging), dare (DARE-TIES), linear (linear average).
6.2.6 Prune
# Wanda pruning with 50% sparsity (default)
make prune CHECKPOINT=checkpoints/tuned.apr PRUNE_METHOD=wanda SPARSITY=0.5
# Magnitude pruning
make prune CHECKPOINT=checkpoints/tuned.apr PRUNE_METHOD=magnitude SPARSITY=0.3
Methods: wanda (default), magnitude, sparsegpt. Sparsity: 0.0–1.0.
6.2.7 Quantize
# INT4 quantization
make quantize CHECKPOINT=checkpoints/pruned.apr SCHEME=int4
# Q6K quantization
make quantize CHECKPOINT=checkpoints/pruned.apr SCHEME=q6k
Schemes: int4, int8, q4k, q5k, q6k.
6.2.8 Pipeline (config-driven)
# Run entire pipeline from a recipe YAML config
make pipeline RECIPE=recipe-a-quick-lora
# Dry-run: show commands without executing
make pipeline-plan RECIPE=recipe-c-full-pipeline
The pipeline script (scripts/pipeline.sh) reads a recipe YAML and runs each stage in order:
import → [distill] → [finetune] → [align] → [merge] → [prune] → [quantize] → eval → [submit] → [compile]
Stages in brackets are optional — only included if the corresponding YAML section exists.
6.2.9 Submit
# Export and publish to HuggingFace Hub
make publish CHECKPOINT=checkpoints/model.apr HF_REPO=paiml/qwen-coder-7b-apr
# Export only (SafeTensors)
make export CHECKPOINT=checkpoints/model.apr EXPORT_FORMAT=safetensors
The submit script (scripts/submit.sh):
- Exports model to SafeTensors via
apr export - Generates model card with benchmark results table
- Dry-run preview via
apr publish --dry-run - Prompts for confirmation before actual upload
6.2.10 Verification
# Verify apr CLI and all subcommands
make verify
# End-to-end smoke test (CLI + configs)
make dogfood
6.3 Orchestration Surface Mapping
The full mapping between Makefile targets and apr CLI operations:
make pipeline RECIPE=recipe-c-full-pipeline
│
│ scripts/pipeline.sh reads YAML, runs stages:
│
├── [import] ──► apr import hf://... -o checkpoints/base.apr
├── [distill] ──► apr distill teacher.apr --student base.apr -o distilled.apr
├── [finetune] ──► apr finetune distilled.apr --method lora -o tuned.apr
├── [align] ──► apr finetune tuned.apr --method dpo -o aligned.apr
├── [merge] ──► apr merge aligned.apr variant.apr --strategy slerp -o merged.apr
├── [prune] ──► apr prune merged.apr --method wanda -o pruned.apr
├── [quantize] ──► apr quantize pruned.apr --scheme int4 -o quantized.apr
├── [eval] ──► scripts/eval-pass-at-k.sh humaneval quantized.apr
├── [submit] ──► scripts/submit.sh quantized.apr org/model
└── [compile] ──► apr compile quantized.apr --release --lto --strip