Leaderboard-Winning Techniques
The techniques in §7 optimize the model. This section covers techniques that optimize inference-time behavior — how you extract the best score from a given model. These are the techniques that separate top-10 leaderboard entries from median ones.
8.1 Sampling Strategy Tuning
Why it matters: The difference between greedy decoding and tuned sampling can be 5-15% pass@1. Most leaderboards evaluate pass@1 with greedy decoding, but the sampling parameters used during generation dramatically affect output quality.
apr command: apr run, apr chat, apr eval
# Greedy (temperature=0, deterministic — standard for leaderboard eval)
apr eval model.apr --task classify --data humaneval.jsonl \
--temperature 0.0 --json
# Tuned nucleus sampling (better for diverse code generation)
apr eval model.apr --task classify --data humaneval.jsonl \
--temperature 0.2 --top_p 0.95 --json
# High-temperature diverse sampling for pass@k (k>1)
apr eval model.apr --task classify --data humaneval.jsonl \
--temperature 0.8 --top_p 0.95 --json
Leaderboard sweet spots:
| Metric | Temperature | Top-P | Rationale |
|---|---|---|---|
| pass@1 | 0.0 (greedy) | 1.0 | Deterministic, reproducible |
| pass@1 (tuned) | 0.1-0.2 | 0.95 | Slight diversity avoids greedy traps |
| pass@10 | 0.6-0.8 | 0.95 | Diversity yields more distinct solutions |
| pass@100 | 0.8-1.0 | 0.95 | Maximum diversity |
8.2 N-Sampling with Best-of-N Selection (pass@k Maximization)
Why it matters: Generating N completions and selecting the best one (via self-consistency, test execution, or log-probability scoring) can boost effective pass@1 by 10-30% over single-shot generation. This is the single most impactful inference-time technique [8].
apr command: apr eval --n-samples
# Generate 20 completions per problem, compute pass@1 and pass@10
apr eval model.apr --task classify --data humaneval.jsonl \
--n-samples 20 --temperature 0.8 --json
# Best-of-N with log-probability reranking
apr eval model.apr --task classify --data humaneval.jsonl \
--n-samples 10 --rerank logprob --json
# Best-of-N with self-consistency (majority voting on output)
apr eval model.apr --task classify --data humaneval.jsonl \
--n-samples 10 --rerank majority --json
Implementation status: N-sampling is implemented in
scripts/eval-pass-at-k.sh via the NUM_SAMPLES parameter. Reranking
strategies (logprob, majority) are not yet implemented. apr eval
does not have --n-samples or --rerank flags — sampling is handled
at the orchestration layer.
Expected gain: +10-30% effective pass@1 with N=10-50 over single-shot greedy.
8.3 Structured Prompting (System Prompt + Few-Shot + SCoT)
Why it matters: Structured Chain-of-Thought (SCoT) prompting improves HumanEval pass@1 by up to 13.79% over vanilla prompting by asking the model to reason through sequential, branch, and loop structures before generating code [9].
apr command: apr eval --prompt-strategy, apr chat --system
# Standard prompt (baseline)
apr eval model.apr --task classify --data humaneval.jsonl \
--prompt-strategy standard --json
# Structured Chain-of-Thought prompting
apr eval model.apr --task classify --data humaneval.jsonl \
--prompt-strategy scot --json
# Few-shot with curated exemplars
apr eval model.apr --task classify --data humaneval.jsonl \
--prompt-strategy few-shot --exemplars exemplars.jsonl --json
# Custom system prompt for code generation
apr eval model.apr --task classify --data humaneval.jsonl \
--system "You are an expert Python programmer. Think step by step." --json
Prompt strategies:
| Strategy | Flag aliases | Description | Expected Impact |
|---|---|---|---|
standard | default | Raw problem → code | Baseline |
scot | structured-cot | Problem → structured reasoning → code | +5-14% pass@1 |
few-shot | fewshot | N exemplars + problem → code | +3-8% pass@1 |
cgo | code-gen-opt | Chain of Grounded Objectives — goal-oriented decomposition | +5-10% pass@1 |
reflexion | reflect | Generate → test → reflect → regenerate (iterative self-correction) | +3-10% pass@1 |
Implementation status: --prompt-strategy is not yet implemented
(PMAT-005). The --system flag is available via upstream apr chat.
Prompt strategy engine planned for eval script integration.
8.4 Speculative Decoding (Inference Speedup)
Why it matters: Speculative decoding yields 2-3x faster inference on code models, which means more attempts within a time budget and faster evaluation iteration. Code is particularly amenable to speculation because syntax is predictable.
apr command: apr run --speculative, apr cbtop --speculative
# Self-speculative decoding (model as its own draft)
apr run model.apr --speculative --speculation-k 4 "def fibonacci(n):"
# Draft model speculative decoding (faster, slightly less accurate)
apr run model.apr --speculative --draft-model-path draft.apr --speculation-k 6 \
"def fibonacci(n):"
# Benchmark speculative vs standard throughput
apr bench model.apr --speculative --speculation-k 4 --json
Implementation status: Speculative decoding engine exists in
aprender internals. CLI flags (--speculative, --speculation-k,
--draft-model-path) are not yet exposed (GH-10).
Expected gain: 2-3x throughput improvement for code generation tasks. No quality change (output distribution is mathematically identical).
8.5 Preference Optimization (DPO/ORPO)
Why it matters: DPO and ORPO align models to prefer correct, well-structured code over plausible but buggy code. ORPO eliminates the need for a reference model, making it simpler than RLHF. Models trained with preference optimization consistently score 3-8% higher on code benchmarks than SFT-only models [10][11].
apr command: apr align (proposed)
# Generate preference pairs from eval results
# (correct completions = chosen, incorrect = rejected)
apr eval model.apr --task classify --data humaneval.jsonl \
--n-samples 20 --export-pairs preference-pairs.jsonl
# DPO alignment (requires reference model)
apr align model.apr \
--method dpo \
--data preference-pairs.jsonl \
--beta 0.1 \
--ref-model base.apr \
-o aligned.apr
# ORPO alignment (no reference model needed, simpler)
apr align model.apr \
--method orpo \
--data preference-pairs.jsonl \
--lambda 0.1 \
-o aligned.apr
Implementation status: DPO loss implemented in entrenar (2026-04-03). WgpuInstructPipeline::dpo_step() computes L = -log σ(β * (chosen_logprob - rejected_logprob)) using existing wgpu forward pass. Lean4 theorem: dpo_loss_nonneg proved. Contract: dpo-alignment-v1. Needs: preference pair data generation via scripts/generate-preference-pairs.sh (PMAT-014) and CLI wiring in apr align.
Expected gain: +3-8% pass@1 over SFT-only models.
8.6 Continued Pretraining (Domain Adaptation)
Why it matters: Continued pretraining on a large code corpus before instruction fine-tuning lets the model absorb domain-specific patterns (API usage, idioms, error handling) that instruction tuning alone can't teach. This is how CodeLlama was built from Llama 2 [12].
apr command: apr finetune --method full
# Continued pretraining on code corpus (full fine-tuning, not LoRA)
apr finetune model.apr \
--method full \
--data code-corpus-500k.jsonl \
--epochs 1 \
--learning-rate 5e-5 \
--json \
-o domain-adapted.apr
# Then LoRA instruction-tune on top
apr finetune domain-adapted.apr \
--method lora \
--rank 16 \
--data code-instruct-50k.jsonl \
--epochs 3 \
-o final-lora/
Implementation status: --method full EXISTS in aprender's finetune command. The training loop in entrenar supports full-model gradient computation.
Key consideration: Continued pretraining requires significant compute (full model gradients, not just adapter). Budget accordingly.
8.7 Data Decontamination
Why it matters: If training data overlaps with benchmark test cases, scores are inflated and meaningless. Leaderboards actively detect and penalize contaminated submissions. Data decontamination is a hard requirement, not optional.
apr command: apr validate --decontaminate (proposed)
# Check training data for benchmark overlap
apr validate --data code-instruct.jsonl \
--decontaminate \
--benchmarks humaneval,mbpp,bigcodebench \
--threshold 0.8 \
--json
# Generate clean training set (remove overlapping samples)
apr validate --data code-instruct.jsonl \
--decontaminate \
--benchmarks humaneval,mbpp \
--output clean-instruct.jsonl
Implementation status: apr data decontaminate implemented and verified. Decontamination report (clean.jsonl) confirms 0% overlap: 0/164 HumanEval contaminated, 0/974 MBPP contaminated.
Falsification gate (AC-016): ✅ Verified. 0% n-gram overlap between training data and evaluation benchmarks.
8.8 Test-Time Compute Scaling
Why it matters: Recent results show that spending more compute at inference time (generating more candidates, longer chain-of-thought, iterative refinement) scales performance more efficiently than model size for code tasks. This is the "scaling at test time" paradigm.
apr command: Composition of existing commands
# Strategy: Generate many → Execute → Filter → Rerank
# Step 1: Generate 50 diverse completions per problem
apr eval model.apr --task classify --data humaneval.jsonl \
--n-samples 50 --temperature 0.8 --json > candidates.json
# Step 2: Execute all candidates in sandbox (EXTERNAL)
# → produces pass/fail per candidate
# Step 3: Among passing candidates, select by log-probability
# → highest log-prob passing candidate = submission
# Step 4: For failing problems, retry with SCoT prompting
apr eval model.apr --task classify --data failing-problems.jsonl \
--n-samples 50 --prompt-strategy scot --temperature 0.6 --json
Expected gain: Diminishing returns, but N=50 with test-based filtering can reach pass@1 equivalent of pass@50, which is typically 15-25% higher than greedy pass@1.
8.9 Technique Stacking: The Winning Formula
Leaderboard winners stack techniques multiplicatively. The winning formula, in priority order:
1. Best base model selection (Qwen2.5-Coder-7B-Instruct) — biggest impact
2. Prompt strategy optimization (§7.6) — +1-25pp (zero cost)
3. Continued pretraining on code corpus — +5-10%
4. Distillation from 32B teacher — +3-8%
5. LoRA/QLoRA instruction fine-tuning — +5-15%
6. DPO/ORPO preference alignment — +3-8%
7. Merge tournament with specialist variants — +2-5%
8. N-sampling with test-based reranking — +10-30% effective
9. Pruning + quantization for inference speed — neutral quality, faster
Not all gains stack linearly. Steps 3-5 compound well. Steps 6-7 have diminishing returns if 3-5 are strong. Step 8 is inference-time and always applies. Step 2 is zero-cost and should always be done first — our dogfooding showed few-shot prompting (+1.83pp HumanEval) and test assertion inclusion (+25.4pp MBPP) outperform some training-based techniques.
Dogfooding correction: SCoT (structured chain-of-thought) was previously listed at +5-14%. Actual measurement on 7B: -3.05pp (82.32% vs 85.37% standard). SCoT helps reasoning-heavy benchmarks (LiveCodeBench) but hurts code completion on ≤7B models where reasoning overhead consumes token budget.
The full apr recipe:
#!/bin/bash
set -euo pipefail
# === Model Optimization (one-time) ===
apr import hf://Qwen/Qwen2.5-Coder-32B -o teacher.apr
apr import hf://Qwen/Qwen2.5-Coder-7B -o base.apr
apr finetune base.apr --method full --data code-corpus-500k.jsonl --epochs 1 -o adapted.apr
apr distill teacher.apr --student adapted.apr --strategy progressive -o distilled.apr
apr finetune distilled.apr --method lora --rank 32 --data code-instruct-50k.jsonl -o lora/
apr finetune distilled.apr --adapter lora/ --merge -o finetuned.apr
# apr align finetuned.apr --method orpo --data preference-pairs.jsonl -o aligned.apr # when implemented
apr merge finetuned.apr variant-b.apr --strategy ties --base-model distilled.apr -o merged.apr
apr prune merged.apr --method wanda --target-ratio 0.2 --calibration calib.jsonl -o pruned.apr
apr quantize pruned.apr --scheme int4 -o final.apr
# === Inference-Time Optimization (per evaluation) ===
apr eval final.apr --task classify --data humaneval.jsonl \
--n-samples 50 --temperature 0.8 --prompt-strategy scot --json