Qwen3.5 Hybrid Attention Architecture
Qwen3.5-9B-Instruct introduces a hybrid attention architecture that alternates
between standard softmax attention and Gated Delta Net (GDN) linear attention
layers. This chapter explains the architecture, the contract-driven config pipeline,
and how to use it with apr.
Architecture Overview
Qwen3.5 uses a 32-layer transformer with two types of attention:
| Layer Type | Mechanism | Complexity | State |
|---|---|---|---|
| Standard (softmax) | Q*K^T / sqrt(d) → softmax → V | O(n^2) | KV cache |
| Linear (GDN) | Recurrent state update | O(n) per step | Recurrent + Conv |
The layer_types field in config.json specifies which layers use which mechanism:
{
"layer_types": [
"attention", "linear", "attention", "linear",
...
]
}
Gated Delta Net Recurrence
Linear attention layers implement the Gated Delta Net equations:
GDN-1: S_t = exp(g_t) * S_{t-1} + k_t (x) delta_t
GDN-2: delta_t = beta_t * (v_t - S_{t-1}^T k_t)
GDN-3: o_t = S_t^T q_t
Where:
g_t = -exp(A_log) * softplus(a_t + dt_bias)is the decay factorbeta_t = sigma(b_t)is the update gate- Q, K are L2-normalized
- State
Sis a[num_v_heads, key_head_dim, value_head_dim]matrix
Key Differences from Qwen2
| Parameter | Qwen2-7B | Qwen3.5-9B |
|---|---|---|
| head_dim | 128 (computed) | 256 (explicit) |
| num_heads | 28 | 16 |
| num_kv_heads | 4 | 4 |
| attention_bias | true | false |
| layer_types | all standard | hybrid |
| vocab_size | 152064 | 248320 |
Config Pipeline
The Qwen3.5 config flows through three stages:
SafeTensors config.json
|
v (SafetensorsConfig with layer_types, head_dim, linear_* fields)
AprTransformerConfig
|
v (config_to_gpu with explicit_head_dim, layer_types)
GpuModelConfig
|
v (is_linear_layer(block_idx) dispatch)
forward_linear_block_incremental() or forward_block_incremental()
Contract-Driven Validation
The contracts/model-families/qwen3_5.yaml contract enforces:
constraints:
attention_type: gqa
has_bias: "false" # No attention bias tensors
activation: silu
mlp_type: swiglu
positional_encoding: rope
These constraints drive weight loading: has_bias=false means the loader
skips bias tensors entirely instead of loading zeros.
Weight Loading
Linear attention layers have different tensor names from standard layers:
Standard Attention Layer Tensors
model.layers.{n}.self_attn.q_proj.weight [4096, 4096]
model.layers.{n}.self_attn.k_proj.weight [1024, 4096]
model.layers.{n}.self_attn.v_proj.weight [1024, 4096]
model.layers.{n}.self_attn.o_proj.weight [4096, 4096]
Gated Delta Net Layer Tensors
model.layers.{n}.self_attn.in_proj_qkvz.weight [QKVZ_dim, 4096]
model.layers.{n}.self_attn.in_proj_ba.weight [2*num_v_heads, 4096]
model.layers.{n}.self_attn.out_proj.weight [4096, value_dim]
model.layers.{n}.self_attn.conv1d.weight [conv_dim, 1, kernel]
model.layers.{n}.self_attn.A_log [num_v_heads]
model.layers.{n}.self_attn.dt_bias [num_v_heads]
model.layers.{n}.self_attn.norm.weight [value_dim]
The in_proj_qkvz tensor is a combined projection that gets split into
Q, K, V, and Z (gate) during loading:
in_proj_qkvz = [Q | K | V | Z]
Q: [key_dim, hidden_dim] -> qkv_weight (part)
K: [key_dim, hidden_dim] -> qkv_weight (part)
V: [value_dim, hidden_dim] -> qkv_weight (part)
Z: [value_dim, hidden_dim] -> linear_attn.z_weight
CLI Usage
# Import Qwen3.5 from HuggingFace
apr import hf://Qwen/Qwen3.5-9B-Instruct -o qwen35.apr --arch qwen3_5
# Inspect hybrid attention config
apr inspect qwen35.apr | grep -E "layer_types|linear_"
# Run inference (realizar handles dispatch)
apr run qwen35.apr --prompt "What is 2+2?" --max-tokens 32
# QA validation
apr qa qwen35.apr --assert-tps 50
Falsification Tests
The contract is protected by 8 Popperian falsification tests:
| Test | What it tries to break |
|---|---|
| QWEN35-001 | Exact dimensions (4096, 256, 16, 4, 248320) |
| QWEN35-002 | has_bias must be false |
| QWEN35-003 | hidden_dim == num_heads * head_dim |
| QWEN35-004 | GQA divisibility (16 % 4 == 0) |
| QWEN35-005 | Shape template dimensions |
| QWEN35-006 | rope_theta = 1,000,000 |
| QWEN35-007 | Architecture class = Qwen3_5ForCausalLM |
| QWEN35-008 | SwiGLU MLP (silu + swiglu) |
Run: cargo test -- falsify_mf_qwen35
Fine-Tuning Support
Qwen3.5-9B is wired into the apr finetune CLI for classification fine-tuning:
# Plan mode — shows config and trainable parameter count
apr finetune --task classify --model-size 9B --plan
# Output:
# Model: 4096h x 32L
# LoRA: rank=16, alpha=16.0, 64 adapters
# Classifier: 4096->5 (20485 params)
# Total trainable: 6,836,229 params
The TransformerConfig::qwen3_5_9b() factory in entrenar mirrors the contract:
| Config Field | Value | Source |
|---|---|---|
| hidden_size | 4096 | qwen3_5.yaml hidden_dim |
| num_attention_heads | 16 | qwen3_5.yaml num_heads |
| num_kv_heads | 4 | qwen3_5.yaml num_kv_heads |
| num_hidden_layers | 32 | qwen3_5.yaml num_layers |
| vocab_size | 248320 | qwen3_5.yaml vocab_size |
| use_bias | false | qwen3_5.yaml has_bias |
| head_dim() | 256 | 4096 / 16 |
CLI aliases: --model-size 9B, --model-size qwen3.5-9b, --model-size qwen3.5
Fine-Tuning Falsification Tests
Seven additional tests (FALSIFY-FT-QWEN35-001..007) verify the config factory matches the contract:
| Test | What it tries to break |
|---|---|
| FT-QWEN35-001 | vocab_size must be 248320 (not Qwen2's 152064) |
| FT-QWEN35-002 | use_bias must be false |
| FT-QWEN35-003 | head_dim() must be 256 |
| FT-QWEN35-004 | num_hidden_layers must be 32 |
| FT-QWEN35-005 | num_kv_heads must be 4 |
| FT-QWEN35-006 | All YAML dimensions match config factory |
| FT-QWEN35-007 | CLI dispatch consistency (9B != Qwen2) |
Run: cargo test -- falsify_ft_qwen35