Technique Playbook

7.1 Knowledge Distillation

Goal: Transfer 32B teacher knowledge into a 7B student that scores within 5% of teacher on pass@1.

apr command: apr distill

StrategyWhen to Useapr Flags
Standard KLSingle teacher, simple transfer--strategy standard --temperature 3.0 --alpha 0.7
ProgressiveCurriculum learning, easy→hard examples--strategy progressive --temperature 2.0
EnsembleMultiple teacher variants--strategy ensemble --temperature 4.0

Leaderboard Recipe:

# Step 1: Import teacher (32B) and student (7B)
apr import hf://Qwen/Qwen2.5-Coder-32B -o teacher-32b.apr
apr import hf://Qwen/Qwen2.5-Coder-7B -o student-7b.apr

# Step 2: Distill with progressive strategy (best for code)
apr distill teacher-32b.apr \
    --student student-7b.apr \
    --strategy progressive \
    --temperature 3.0 \
    --alpha 0.7 \
    --epochs 5 \
    --data code-corpus.jsonl \
    -o distilled-7b.apr

# Step 3: Evaluate improvement
apr eval distilled-7b.apr --task classify --data humaneval.jsonl --json

Why progressive: In aprender, progressive distillation uses curriculum learning — training on progressively harder examples — not layer-by-layer MSE matching. This is critical because the 32B teacher and 7B student have different layer counts with no 1:1 correspondence. Curriculum learning lets the student first learn simple code patterns (variable assignment, basic loops) from the teacher's soft targets, then graduate to complex patterns (nested control flow, type inference). Standard KL trains on all difficulties simultaneously, overwhelming the smaller student.

Expected gain: +3-8% pass@1 over baseline student.

7.2 Model Merging

Goal: Combine fine-tuned variants to get best-of-all-worlds without additional training.

apr command: apr merge

StrategyMechanismBest For
averageArithmetic mean of weightsQuick baseline, similar models
weighted--weights 0.7,0.3Known-better model dominates
slerpSpherical interpolationSmooth blending, preserves magnitude
tiesTrim, Elect Sign, merge (sparse)Resolving conflicting task vectors
dareDrop And REscale random weightsPreventing catastrophic interference

Leaderboard Recipe — The "Merge Tournament":

# Train 3 specialists on different code domains
apr finetune base.apr --method lora --data python-instruct.jsonl -o python-expert.apr
apr finetune base.apr --method lora --data rust-instruct.jsonl -o rust-expert.apr
apr finetune base.apr --method lora --data typescript-instruct.jsonl -o ts-expert.apr

# Round 1: DARE merge Python + Rust (resolve task-vector interference)
apr merge python-expert.apr rust-expert.apr \
    --strategy dare \
    --drop-rate 0.3 \
    --base-model base.apr \
    -o round1.apr

# Round 2: TIES merge with TypeScript expert (resolve sign conflicts)
apr merge round1.apr ts-expert.apr \
    --strategy ties \
    --base-model base.apr \
    --density 0.2 \
    -o semifinal.apr

# Round 3: SLERP blend with base for stability (preserve weight norms)
apr merge semifinal.apr base.apr \
    --strategy slerp \
    --weights 0.85,0.15 \
    -o merged-final.apr

Why DARE → TIES → SLERP cascade: DARE first resolves task-vector interference between the two specialists at a conservative 30% drop rate (not 90% — high drop rates destroy blended knowledge). TIES then handles sign conflicts when adding the third specialist. SLERP finally smooths the merged result against the base model with mild interpolation (85/15) to preserve weight norms without diluting specialization.

Expected gain: +2-5% pass@1 over best individual specialist. Free compute — no GPU needed.

7.3 Pruning

Goal: Remove 20-50% of weights with <2% quality loss, yielding faster inference for benchmarks.

apr command: apr prune

MethodMechanismQuality Preservation
magnitudeRemove smallest weightsBaseline, simple
structuredRemove entire attention heads/FFN dimsFastest inference speedup
depthRemove entire layersDramatic size reduction
widthReduce hidden dimensionsBalanced size/quality
wandaWeights AND Activations (calibration-based)Best quality at high sparsity
sparsegptOne-shot, column-by-columnGold standard, needs calibration

Leaderboard Recipe — Wanda Pruning:

# Step 1: Generate calibration data from code corpus
# (128 samples of representative code)

# Step 2: Analyze pruning opportunities first
apr prune model.apr --analyze --verbose

# Step 3: Wanda prune at 30% sparsity (sweet spot for code models)
apr prune model.apr \
    --method wanda \
    --target-ratio 0.3 \
    --calibration calibration-code.jsonl \
    -o pruned-30.apr

# Step 4: Verify quality didn't degrade
apr eval pruned-30.apr --dataset wikitext-2 --threshold 22.0

Why Wanda over magnitude: Magnitude pruning treats all weights equally. Wanda scores weights by |weight| * ||activation||, preserving weights on high-activation paths. For code models, the attention heads responsible for bracket-matching and indentation have high activations — Wanda preserves them.

Pruning budget by model size (Wanda):

ModelConservativeModerateAggressiveSpeed Gain (conservative)
1.5B20%30%40%1.2-1.3x
7B20%25%35%1.2-1.4x
32B15%20%30%1.1-1.3x

Expected impact: Conservative ratio targets <1% pass@1 degradation. Moderate allows 1-3% degradation for meaningful speedup. Aggressive (>30% for small models) risks measurable quality loss — validate with eval before accepting. Smaller models have less redundancy; budget accordingly.

7.4 Fine-tuning (LoRA)

Goal: Adapt base model to code-specific instruction-following with minimal compute.

apr command: apr finetune

# Auto-select method based on available VRAM
apr finetune qwen-7b.apr --method auto --vram 24 --plan

# LoRA fine-tune (rank 16, good default for code)
apr finetune qwen-7b.apr \
    --method lora \
    --rank 16 \
    --data code-instruct-50k.jsonl \
    --epochs 3 \
    --learning-rate 2e-4 \
    -o qwen-7b-lora/

# Merge adapter back into base
apr finetune qwen-7b.apr \
    --adapter qwen-7b-lora/ \
    --merge \
    -o qwen-7b-finetuned.apr

Key parameters for leaderboard performance:

ParameterCode ModelsGeneral Models
Rank16-328-16
Alpha2x rank2x rank
LR1e-4 to 3e-41e-4 to 2e-4
Epochs3-52-3
Target modulesq_proj, v_projq_proj, v_proj

Expected gain: +5-15% pass@1 with curated instruction data.

7.5 Fine-tuning (QLoRA)

Goal: Same as LoRA but on consumer GPUs (8-16GB VRAM).

apr command: apr finetune --method qlora

# Plan QLoRA configuration for 16GB VRAM
apr tune qwen-7b.apr --method qlora --vram 16 --plan

# QLoRA fine-tune (quantized base, full-precision adapters)
apr finetune qwen-7b.apr \
    --method qlora \
    --rank 32 \
    --vram 16 \
    --data code-instruct-50k.jsonl \
    --epochs 3 \
    --learning-rate 2e-4 \
    -o qwen-7b-qlora/

# Merge adapter
apr finetune qwen-7b.apr \
    --adapter qwen-7b-qlora/ \
    --merge \
    -o qwen-7b-qlora-merged.apr

QLoRA vs LoRA tradeoff (at rank 16):

AspectLoRA (rank 16)QLoRA (rank 16)QLoRA (rank 32)
VRAM (7B)~28GB~12GB~16GB
VRAM (32B)~80GB~24GB~32GB
Quality lossNoneData-dependentData-dependent
Training speedFastest~20% slower~25% slower

VRAM depends on rank: Higher LoRA rank = more adapter parameters = more memory for gradients and optimizer states. The numbers above assume batch size 1 with gradient accumulation; larger batch sizes increase VRAM proportionally.

When to use QLoRA: Always for 32B models. For 7B, use LoRA if you have 32GB+ VRAM. When targeting INT4 deployment, prefer QLoRA — it provides implicit quantization awareness.

7.6 Prompt Strategy (Zero-Cost Technique)

Goal: Maximize pass@1 without any model modification. Zero training cost, immediate results.

eval command: make eval-humaneval PROMPT_STRATEGY=few-shot

StrategyHumanEval 7BHumanEval 32BMBPP 7BWhen to Use
few-shot87.20% (+1.83pp)87.20% (-3.65pp)74.80% (-1.40pp)Best for 7B HumanEval only.
standard85.37% (baseline)90.85% (baseline)76.20%Best for 32B and MBPP.
cgo83.54% (-1.83pp)Slight overhead.
scot82.32% (-3.05pp)Hurts ≤7B models.

Key findings from dogfooding (§22.21):

  1. Benchmark-specific strategy is critical. Few-shot helps 7B HumanEval (+1.83pp) but hurts MBPP (-1.40pp) and 32B HumanEval (-3.65pp). No single strategy wins everywhere.
  2. 32B doesn't need prompting tricks. Standard prompting gives 32B its best score (90.85%). Larger models already know the format — exemplars add noise.
  3. MBPP needs test assertions, not few-shot. Including test_list assertions = +25.4pp (50.80% → 76.20%). Few-shot on top of test assertions actually hurts (-1.40pp).
  4. Simpler exemplars win when few-shot helps. Trivial add(a,b) (87.20%) > 3 concrete exemplars (85.98%). Format priming only.

Leaderboard recipe: Use few-shot for 7B HumanEval, standard for everything else. Always include test assertions for MBPP. This costs zero compute and yields the highest known apr-native scores.

7.8 Quantization (Post-Training)

Goal: Reduce model size for faster inference with minimal quality loss.

apr command: apr quantize

# Plan quantization impact
apr quantize model.apr --scheme int4 --plan

# Quantize to INT4 (best size/quality for leaderboard)
apr quantize model.apr --scheme int4 -o model-q4.apr

# Batch quantize to compare schemes
apr quantize model.apr --batch int8,int4,fp16,q4k

# Quantize with format conversion for submission
apr quantize model.apr --scheme int4 --format gguf -o model.gguf

7.9 Hyperparameter Optimization (HPO)

Goal: Find optimal LoRA/QLoRA hyperparameters automatically.

apr command: apr tune

# Scout phase: 1-epoch trials to narrow search space
apr tune qwen-7b.apr \
    --task classify \
    --data code-instruct-50k.jsonl \
    --budget 20 \
    --strategy tpe \
    --scheduler asha \
    --scout \
    --json

# Full HPO: warm-start from scout results
apr tune qwen-7b.apr \
    --task classify \
    --data code-instruct-50k.jsonl \
    --budget 10 \
    --from-scout scout-results/ \
    --max-epochs 20 \
    --time-limit 8h