CLI Toolchain

Two layers work together: apr (upstream aprender — ML operations) and make (this repo — orchestration via Makefile + shell scripts). Every technique maps to a single shell command. Our competitors use 500-line Python scripts; we use one-liners.

6.1 The apr CLI (aprender)

The upstream apr binary provides all ML operations. The Makefile and shell scripts call these under the hood.

6.1.1 Import (HF → APR)

# Import from HuggingFace Hub — auto-detects architecture
apr import hf://Qwen/Qwen2.5-Coder-7B -o qwen-7b.apr --arch qwen2

# Import with quantization on ingest
apr import hf://Qwen/Qwen2.5-Coder-32B -o qwen-32b-q8.apr --quantize int8

# Import GGUF with provenance enforcement
apr import qwen-7b.gguf -o qwen-7b.apr --enforce-provenance

6.1.2 Batch Inference (GH-batch)

# Batch inference: load model + CUDA JIT once, process all prompts sequentially
# Eliminates ~80s per-invocation overhead on gx10 sm_121 Blackwell GPU
apr run model.apr --batch-jsonl prompts.jsonl --max-tokens 512

# GPU: auto-dispatches CUDA → wgpu (Vulkan) → CPU.
# wgpu batch WORKS (GH-560 fixed 2026-03-28): identical output to CPU, 1.1-2.0 tok/s on 7B.
# CUDA still broken (cosine=-0.005, GH-561 pending). wgpu is the production GPU path.

Input format (JSONL):

{"prompt": "def fibonacci(n):", "task_id": "HumanEval/0", "max_tokens": 512}
{"prompt": "def add(a, b):", "task_id": "HumanEval/1"}

Output format (JSONL, one line per prompt):

{"task_id": "HumanEval/0", "text": "...", "tokens_generated": 85, "tok_per_sec": 14.2, "inference_ms": 5986.0, "used_gpu": true}

Sampling flags (also available in batch mode):

FlagDefaultDescription
--temperature0.0Sampling temperature (0.0 = greedy)
--top-k1Top-k sampling (1 = greedy)

Auto-detects model format (GGUF or APR). GPU auto-dispatches: CUDA (parity gate) → wgpu (Vulkan) → CPU. On Blackwell sm_121, CUDA blocked by parity gate (cosine=-0.005, GH-561). wgpu batch works after GH-560 two-bug fix: FFN buffer overflow in trueno (attn_out_buf was hidden_dim, needs intermediate_dim) + KV cache pre-filled length in realizar. Never bypass the parity gate — fix root cause. Model stays resident across all prompts.

6.1.3 Evaluate (Baseline)

# Perplexity baseline
apr eval qwen-7b.apr --dataset wikitext-2 --threshold 20.0

# Classification eval with custom data
apr eval qwen-7b.apr --task classify --data humaneval.jsonl --json

6.1.4 Instruction Fine-tuning (GH-371)

# Instruction fine-tuning with LoRA on Q/V projections
apr finetune model.apr --task instruct --data instruct.jsonl --epochs 3 --rank 16

# QLoRA on consumer GPU (NF4 base + FP16 adapters, ~4.5 GB VRAM)
apr finetune model.apr --task instruct --method qlora --quantize-nf4 \
    --data instruct.jsonl --rank 16 --vram 8 --max-seq-len 512

# Multi-adapter concurrent training (GPU-SHARE)
apr finetune model.apr --task instruct --method qlora --quantize-nf4 \
    --adapters-config adapters.toml

# With experimental multi-process GPU sharing
apr finetune model.apr --task instruct --experimental-mps --gpu-share 50

# Plan-only mode (shows config without training)
apr finetune --task instruct --model-size 7B --plan

Corpus format (JSONL):

{"instruction": "Write a function that...", "response": "def foo():\n    ..."}
{"instruction": "...", "response": "...", "system": "You are...", "metadata": {"source": "depyler"}}

Adapters config format (TOML):

[[adapter]]
data = "data/corpus-a.jsonl"
checkpoint = "checkpoints/adapter-a"
label = "code-review"
rank = 16
learning_rate = 0.0002

Contracts:

  • F-INST-001: Non-empty instruction and response
  • F-INST-002: Cross-entropy loss computed only on response tokens
  • F-INST-003: Perplexity reported per epoch
  • F-INST-004: Qwen chat template (<|im_start|> / <|im_end|>)
  • GPU-SHARE-002: VRAM reservation via ledger before allocation

6.1.5 Full Optimization Pipeline (preview)

# The complete leaderboard recipe in 6 commands (follows golden ordering §10):
apr import hf://Qwen/Qwen2.5-Coder-7B -o base.apr
apr distill teacher.apr --student base.apr --strategy progressive --temperature 3.0 -o distilled.apr
apr finetune distilled.apr --method qlora --rank 32 --data code-instruct.jsonl -o tuned.apr
apr merge tuned.apr variant-b.apr --strategy slerp -o merged.apr
apr prune merged.apr --method wanda --target-ratio 0.2 --calibration calib.jsonl -o pruned.apr
apr quantize pruned.apr --scheme int4 -o submit.apr

6.2 The make Orchestration Layer (this repo)

The orchestration layer that drives the pipeline. Each Makefile target maps to one or more apr CLI subcommands or shell scripts.

Make TargetCallsDescription
make importapr importDownload HF model → .apr format
make prep-dataapr data prepExtract instruction/response pairs from Python source (GH-7)
make eval-humanevalscripts/eval-pass-at-k.shGenerate completions → sandbox execute → pass@k
make eval-mbppscripts/eval-pass-at-k.shSame pipeline, MBPP dataset
make eval-bigcodebenchscripts/eval-pass-at-k.shSame pipeline, BigCodeBench dataset
make eval-allscripts/eval-pass-at-k.sh × 3All benchmarks sequentially
make eval-perplexityapr eval --dataset wikitext-2Perplexity baseline
make finetune-instructapr finetune --task instructInstruction LoRA fine-tuning (GH-371)
make finetuneapr finetuneClassification LoRA/QLoRA fine-tuning
make alignapr finetune --method dpo/orpoDPO/ORPO preference alignment (GH-8)
make distillapr distillKnowledge distillation (teacher → student)
make mergeapr mergeModel merging (SLERP, TIES, DARE, linear)
make pruneapr pruneStructured/unstructured pruning
make quantizeapr quantizePost-training quantization
make compileapr compileCompile model to standalone binary
make checkapr checkValidate APR format and integrity
make inspectapr inspectModel inspection
make exportapr exportSafeTensors/GGUF export
make publishscripts/submit.shExport + model card + HF Hub upload
make model-cardapr eval --generate-cardGenerate model card
make pipelinescripts/pipeline.shConfig-driven end-to-end pipeline (12 stages)
make pipeline-planscripts/pipeline.sh --planDry-run: validate config, show commands
make verifysmoke-tests all apr subcommandsValidate apr CLI installation
make dogfoodCLI + config validationEnd-to-end smoke test
make validatebashrs config lint + bashrs lintLint all configs + scripts
make prove-wgpuscripts/prove-wgpu.shwgpu GPU training proof
make import-planHF Hub check + dry-runImport plan preview
make prep-data-auditapr data audit --verboseDetailed corpus audit
make decontaminateapr data decontaminateN-gram overlap gate (AC-016)
make data-qualityapr data qualityQuality scoring gate (AC-025)
make qaapr qa --verboseFull model QA gate
make compare-hfapr compare-hf --hf MODEL --jsonHF parity check
make benchapr bench --jsonThroughput benchmark (tok/s, TTFT)
make data-splitapr data splitStratified train/val/test split
make data-balanceapr data balanceResample for class balance
make benchmark-downloadscripts/download-benchmarks.shDownload HumanEval/MBPP data
make results-historyscripts/results-history.shView and compare eval results
make eval-sweepscripts/eval-sweep.shSweep all result JSONs, tabulate pass@k across models
make compare-resultsscripts/compare-results.shDelta analysis between two result files
make leaderboardscripts/leaderboard-summary.shGenerate ranked markdown leaderboard from results
make check-contractsinline awk + jq + python3Run falsification tests (pass@k, throughput, structure)
make cleanrm -rf checkpoints/ results/Remove build artifacts
make bookmdbook buildBuild specification book
make docsmdbook buildAlias for book
make docs-servemdbook serveLocal book preview

6.2.1 Import

# Import a HuggingFace model to .apr format
make import MODEL=Qwen/Qwen2.5-Coder-7B-Instruct

# Import with custom output path
make import MODEL=Qwen/Qwen2.5-Coder-7B CHECKPOINT=checkpoints/qwen7b.apr

# Import via standalone script (with validation)
./scripts/import.sh Qwen/Qwen2.5-Coder-7B checkpoints/qwen7b.apr

6.2.2 Eval

# Run HumanEval with defaults (512 tokens, temperature 0.0, 1 sample, standard prompt)
make eval-humaneval CHECKPOINT=checkpoints/qwen-7b.apr

# Full benchmark suite
make eval-all CHECKPOINT=checkpoints/qwen-7b.apr

# Custom parameters with structured chain-of-thought prompting
make eval-humaneval CHECKPOINT=checkpoints/qwen-7b.apr \
    MAX_TOKENS=1024 TEMPERATURE=0.2 NUM_SAMPLES=10 PROMPT_STRATEGY=scot

# Perplexity baseline
make eval-perplexity CHECKPOINT=checkpoints/qwen-7b.apr
VariableDefaultDescription
MAX_TOKENS512Max tokens per completion
TEMPERATURE0.0Sampling temperature
NUM_SAMPLES1Completions per problem (for pass@k)
PROMPT_STRATEGYstandardPrompt strategy: standard, scot, few-shot, cgo

The eval script (scripts/eval-pass-at-k.sh) handles the full pipeline:

  1. Downloads benchmark data (HumanEval, MBPP, BigCodeBench) if not cached
  2. For each problem: generates completion via apr run with chosen prompt strategy
  3. Strips markdown fences, combines completion + test cases
  4. Executes in python3/Docker sandbox with timeout 10
  5. Computes pass@k via Chen et al. unbiased estimator and writes result JSON

6.2.3 Data Preparation

# Audit instruction corpus quality
make prep-data

# Detailed audit output
make prep-data-audit

Data preparation uses apr data prep (GH-7) to extract function/class definitions with docstrings from ground truth corpora via Rust AST parsing (tree-sitter). Sources:

  • depyler (~11.8K pairs): Python algorithms, data structures, CLI examples
  • hf-gtc (~3.5K pairs): HuggingFace production recipes
  • jax-gtc (~58 pairs): JAX numerical computing patterns
  • vllm-gtc (~81 pairs): vLLM inference optimization patterns

Total: ~15.5K instruction/response pairs in JSONL format.

6.2.4 Finetune

# Instruction fine-tuning with data from ground truth corpora (GH-371)
make prep-data                    # generate data/instruct-corpus.jsonl
make finetune-instruct            # defaults: model_size=7B, rank=16, lr=0.0002, 3 epochs

# Custom instruction fine-tuning config
make finetune-instruct MODEL_SIZE=7B RANK=32 LR=0.001 EPOCHS=5

# Classification LoRA fine-tune (original path)
make finetune CHECKPOINT=checkpoints/qwen-7b.apr DATA=data/code-instruct.jsonl

# QLoRA with custom config
make finetune CHECKPOINT=checkpoints/qwen-7b.apr DATA=data/code-instruct.jsonl \
    METHOD=qlora RANK=32 LR=0.001 EPOCHS=5

Tasks: instruct (generative, GH-371), classify (classification). Methods: lora (default), qlora (quantized LoRA), full (all parameters).

VariableDefaultDescription
METHODloraFine-tuning method
RANK16LoRA rank
LR0.0002Learning rate
EPOCHS3Number of epochs
DATAdata/instruct-corpus.jsonlTraining dataset
MODEL_SIZE7BModel size for instruct task (tiny/0.5B/7B/9B)

6.2.4 Distill

# Progressive distillation (recommended for code models)
make distill TEACHER=checkpoints/teacher-32b.apr STUDENT=checkpoints/student-7b.apr \
    DIST_STRATEGY=progressive DIST_TEMP=3.0 DIST_ALPHA=0.7

Strategies: standard (KL divergence), progressive (curriculum learning), ensemble (multi-teacher).

VariableDefaultDescription
DIST_STRATEGYstandardDistillation strategy
DIST_TEMP3.0Softmax temperature
DIST_ALPHA0.7Mixing coefficient (0=student, 1=teacher)

6.2.5 Merge

# SLERP merge of two models
make merge MODELS="checkpoints/a.apr checkpoints/b.apr" STRATEGY=slerp

# TIES merge (set via recipe YAML for full control)
make pipeline RECIPE=recipe-b-merge-alchemist

Strategies: slerp, ties (TIES-Merging), dare (DARE-TIES), linear (linear average).

6.2.6 Prune

# Wanda pruning with 50% sparsity (default)
make prune CHECKPOINT=checkpoints/tuned.apr PRUNE_METHOD=wanda SPARSITY=0.5

# Magnitude pruning
make prune CHECKPOINT=checkpoints/tuned.apr PRUNE_METHOD=magnitude SPARSITY=0.3

Methods: wanda (default), magnitude, sparsegpt. Sparsity: 0.0–1.0.

6.2.7 Quantize

# INT4 quantization
make quantize CHECKPOINT=checkpoints/pruned.apr SCHEME=int4

# Q6K quantization
make quantize CHECKPOINT=checkpoints/pruned.apr SCHEME=q6k

Schemes: int4, int8, q4k, q5k, q6k.

6.2.8 Pipeline (config-driven)

# Run entire pipeline from a recipe YAML config
make pipeline RECIPE=recipe-a-quick-lora

# Dry-run: show commands without executing
make pipeline-plan RECIPE=recipe-c-full-pipeline

The pipeline script (scripts/pipeline.sh) reads a recipe YAML and runs each stage in order:

import → [distill] → [finetune] → [align] → [merge] → [prune] → [quantize] → eval → [submit] → [compile]

Stages in brackets are optional — only included if the corresponding YAML section exists.

6.2.9 Submit

# Export and publish to HuggingFace Hub
make publish CHECKPOINT=checkpoints/model.apr HF_REPO=paiml/qwen-coder-7b-apr

# Export only (SafeTensors)
make export CHECKPOINT=checkpoints/model.apr EXPORT_FORMAT=safetensors

The submit script (scripts/submit.sh):

  1. Exports model to SafeTensors via apr export
  2. Generates model card with benchmark results table
  3. Dry-run preview via apr publish --dry-run
  4. Prompts for confirmation before actual upload

6.2.10 Verification

# Verify apr CLI and all subcommands
make verify

# End-to-end smoke test (CLI + configs)
make dogfood

6.3 Orchestration Surface Mapping

The full mapping between Makefile targets and apr CLI operations:

make pipeline RECIPE=recipe-c-full-pipeline
    │
    │  scripts/pipeline.sh reads YAML, runs stages:
    │
    ├── [import]    ──► apr import hf://... -o checkpoints/base.apr
    ├── [distill]   ──► apr distill teacher.apr --student base.apr -o distilled.apr
    ├── [finetune]  ──► apr finetune distilled.apr --method lora -o tuned.apr
    ├── [align]     ──► apr finetune tuned.apr --method dpo -o aligned.apr
    ├── [merge]     ──► apr merge aligned.apr variant.apr --strategy slerp -o merged.apr
    ├── [prune]     ──► apr prune merged.apr --method wanda -o pruned.apr
    ├── [quantize]  ──► apr quantize pruned.apr --scheme int4 -o quantized.apr
    ├── [eval]      ──► scripts/eval-pass-at-k.sh humaneval quantized.apr
    ├── [submit]    ──► scripts/submit.sh quantized.apr org/model
    └── [compile]   ──► apr compile quantized.apr --release --lto --strip