LoRA Fine-Tuning
Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that freezes the base model weights and injects trainable low-rank matrices into attention projections. Instead of updating all parameters (billions for modern LLMs), LoRA trains only a small number of additional parameters (typically < 1% of the base model).
How LoRA Works
For a pretrained weight matrix \( W_0 \in \mathbb{R}^{d \times k} \), LoRA constrains the update to a low-rank decomposition:
\[ W = W_0 + \Delta W = W_0 + BA \]
where \( B \in \mathbb{R}^{d \times r} \), \( A \in \mathbb{R}^{r \times k} \), and the rank \( r \ll \min(d, k) \).
- A is initialized with random Gaussian values
- B is initialized to zero (so \( \Delta W = 0 \) at the start of training)
- Only A and B are updated during training
- The base model weights \( W_0 \) remain frozen
Supported Base Models
| Model | CLI Flag | Hidden Size | Head Dim | Vocab | Bias |
|---|---|---|---|---|---|
| Qwen2.5-Coder-0.5B | --model-size 0.5B | 896 | 64 | 151,936 | Yes |
| Qwen3.5-9B-Instruct | --model-size 9B | 4,096 | 256 | 248,320 | No |
CLI Quickstart
Classification Fine-Tuning
# Plan mode (estimate VRAM, show config)
apr finetune --task classify --model-size 9B --plan
# Train with data
apr finetune --task classify \
--model-size 9B \
--data train.jsonl \
--epochs 10 \
--rank 16 \
-o checkpoints/
# Qwen2.5 (smaller, for testing)
apr finetune --task classify \
--model-size 0.5B \
--data train.jsonl \
-o checkpoints/
General LoRA Fine-Tuning
# Plan mode
apr finetune model.apr --method lora --model-size 7B --plan
# Train
apr finetune model.apr --method lora --data train.jsonl -o adapter.apr
# Merge adapter back into base model
apr finetune merge model.apr --adapter adapter.apr -o merged.apr
Qwen3.5-9B Specifics
Qwen3.5 introduces several architectural differences from Qwen2 that affect fine-tuning:
No Attention Bias
Qwen3.5 does not use bias in Q/K/V/O projections (use_bias=false). This means:
- LoRA adapters target only the weight matrices, not bias vectors
- The LoRA parameter count is slightly lower than equivalent Qwen2 models
Explicit Head Dimension
Qwen3.5-9B uses head_dim=256 (vs Qwen2's typical 128). The attention projection
shapes are:
| Projection | Shape | Notes |
|---|---|---|
| Q proj | [4096, 4096] | 16 heads x 256 head_dim |
| K proj | [1024, 4096] | 4 KV heads x 256 head_dim |
| V proj | [1024, 4096] | 4 KV heads x 256 head_dim |
| O proj | [4096, 4096] | hidden_dim x (num_heads x head_dim) |
Hybrid Attention
Qwen3.5 uses a mix of standard softmax attention and linear attention layers. LoRA targets Q/V projections in both layer types, ensuring the adapter captures both attention mechanisms.
248K Vocabulary
The larger vocabulary (248,320 tokens vs Qwen2's ~152K) affects embedding layer dimensions but does not change the LoRA targeting strategy, which focuses on attention projections.
Classification Pipeline
The classification fine-tuning pipeline consists of:
- Base model config -- loaded via
TransformerConfig::qwen3_5_9b()orqwen2_0_5b() - LoRA injection -- rank-16 adapters on Q and V projections
- Classification head -- mean pooling + linear layer (hidden_size -> num_classes)
- Training loop -- epoch management, validation split, early stopping, LR scheduling
- Checkpointing -- periodic saves to APR format
Input text
|
v
[Tokenize] -> [Embedding] -> [Transformer Layers (frozen + LoRA)] -> [Mean Pool]
|
v
[Classification Head (trainable)] -> [Softmax] -> [Cross-Entropy Loss]
Contract Validation
Fine-tuning configs are validated against the model family contract:
contracts/model-families/qwen3_5.yaml-- source of truth for dimensionscontracts/classification-finetune-v1.yaml-- classification invariantssrc/format/model_family_contract_falsify.rs-- Popperian falsification tests
The falsification tests (FALSIFY-FT-QWEN35-001 through 007) verify that the
TransformerConfig::qwen3_5_9b() factory matches the YAML contract exactly.
If they diverge, the test suite catches it before any training runs.
References
- Hu et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models
- Dettmers et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs
docs/specifications/qwen3.5-fine-tune.md-- full specification