Scientific Foundation (References)

Every technique in this spec has a peer-reviewed or widely-cited basis. References are grouped by the pipeline stage they support.

20.1 Training Techniques

[1] Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models", ICLR 2022. Basis for apr finetune --method lora. Rank-16 to rank-64 adapters on Q/V projections.

[2] Dettmers et al., "QLoRA: Efficient Finetuning of Quantized Language Models", NeurIPS 2023. Basis for apr finetune --method qlora. NF4 base weights + FP16 adapters. 4-8 GB VRAM.

[3] Hinton et al., "Distilling the Knowledge in a Neural Network", arXiv:1503.02531, 2015. Basis for apr distill. KL-divergence soft-target transfer from teacher to student.

[4] Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", NeurIPS 2023. Basis for apr align --method dpo. Preference optimization without reward model.

[5] Hong et al., "ORPO: Monolithic Preference Optimization without Reference Model", EMNLP 2024. Basis for apr align --method orpo. No reference model needed — simpler than DPO.

20.2 Model Compression

[6] Sun et al., "A Simple and Effective Pruning Approach for Large Language Models" (Wanda), ICLR 2024. Basis for apr prune --method wanda. Activation-aware pruning in one shot.

[7] Frantar & Alistarh, "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot", ICML 2023. Alternative pruning approach. Basis for structured pruning comparisons.

[8] Yadav et al., "TIES-Merging: Resolving Interference When Merging Models", NeurIPS 2023. Basis for apr merge --strategy ties. Trim, elect sign, disjoint merge.

[9] Yu et al., "Language Model is Sometimes a Knowledge Base" (DARE), arXiv:2311.03099, 2023. Basis for apr merge --strategy dare. Drop and rescale for sparse merging.

[10] Goddard et al., "Arcee's MergeKit: A Toolkit for Merging Large Language Models", arXiv:2403.13257, 2024. Reference implementation for SLERP, TIES, DARE merge strategies.

20.3 GPU Architecture

[20] NVIDIA, "Parallel Thread Execution ISA Version 8.5", 2024. PTX is NVIDIA's stable intermediate representation. trueno-gpu writes kernels as PTX string templates in Rust — no nvcc, no CUDA toolkit. JIT-compiled to SASS at runtime by the CUDA driver. This is the same fallback mechanism PyTorch uses for unsupported architectures; trueno-gpu uses it as the primary path (§5.10).

20.4 Inference Optimization

[11] Leviathan et al., "Fast Inference from Transformers via Speculative Decoding", ICML 2023. Basis for apr run --speculative. Draft model proposes, main model verifies.

[12] Wang et al., "Self-Consistency Improves Chain of Thought Reasoning in Language Models", ICLR 2023. Basis for N-sampling + majority voting reranking in apr eval --n-samples --rerank majority.

[13] Li et al., "Structured Chain-of-Thought Prompting for Code Generation", ACM TOSEM 2025. Basis for --prompt-strategy scot. Structure reasoning before code output. Dogfooding note: SCoT hurts ≤7B Q4K models (-3.05pp on HumanEval, §22.0). Reasoning overhead consumes token budget. Simple few-shot prompting (+1.83pp) is superior at this scale.

20.4 Benchmarks and Evaluation

[14] Hui et al., "Qwen2.5-Coder Technical Report", arXiv:2409.12186, 2024. Primary target model architecture. Baseline scores for HumanEval/MBPP.

[15] Jain et al., "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code", arXiv:2403.07974, 2024. Continuously refreshed benchmark. Contamination-resistant evaluation.

[16] Zhuo et al., "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions", arXiv:2406.15877, 2024. Practical coding tasks with library usage. Not yet saturated (GPT-4o ~61%).

[17] NVIDIA, "OpenCodeReasoning: Advancing Data Distillation for Competitive Coding", arXiv:2504.01943, 2025. OCR-Nemotron reasoning distillation results. LiveCodeBench SOTA.

20.5 Code Generation Foundations

[18] Rozière et al., "Code Llama: Open Foundation Models for Code", arXiv:2308.12950, 2023. Fill-in-middle (FIM) training methodology. Infilling objective for code completion.

[19] Chen et al., "Evaluating Large Language Models Trained on Code" (Codex/HumanEval), arXiv:2107.03374, 2021. Defines pass@k metric and unbiased estimator. The benchmark that started it all.