Qwen3.5 Hybrid Attention Architecture

Qwen3.5-9B-Instruct introduces a hybrid attention architecture that alternates between standard softmax attention and Gated Delta Net (GDN) linear attention layers. This chapter explains the architecture, the contract-driven config pipeline, and how to use it with apr.

Architecture Overview

Qwen3.5 uses a 32-layer transformer with two types of attention:

Layer Type	Mechanism	Complexity	State
Standard (softmax)	Q*K^T / sqrt(d) → softmax → V	O(n^2)	KV cache
Linear (GDN)	Recurrent state update	O(n) per step	Recurrent + Conv

The layer_types field in config.json specifies which layers use which mechanism:

{
  "layer_types": [
    "attention", "linear", "attention", "linear",
    ...
  ]
}

Gated Delta Net Recurrence

Linear attention layers implement the Gated Delta Net equations:

GDN-1: S_t = exp(g_t) * S_{t-1} + k_t (x) delta_t
GDN-2: delta_t = beta_t * (v_t - S_{t-1}^T k_t)
GDN-3: o_t = S_t^T q_t

Where:

g_t = -exp(A_log) * softplus(a_t + dt_bias) is the decay factor
beta_t = sigma(b_t) is the update gate
Q, K are L2-normalized
State S is a [num_v_heads, key_head_dim, value_head_dim] matrix

Key Differences from Qwen2

Parameter	Qwen2-7B	Qwen3.5-9B
head_dim	128 (computed)	256 (explicit)
num_heads	28	16
num_kv_heads	4	4
attention_bias	true	false
layer_types	all standard	hybrid
vocab_size	152064	248320

Config Pipeline

The Qwen3.5 config flows through three stages:

SafeTensors config.json
    |
    v  (SafetensorsConfig with layer_types, head_dim, linear_* fields)
AprTransformerConfig
    |
    v  (config_to_gpu with explicit_head_dim, layer_types)
GpuModelConfig
    |
    v  (is_linear_layer(block_idx) dispatch)
forward_linear_block_incremental() or forward_block_incremental()

Contract-Driven Validation

The contracts/model-families/qwen3_5.yaml contract enforces:

constraints:
  attention_type: gqa
  has_bias: "false"        # No attention bias tensors
  activation: silu
  mlp_type: swiglu
  positional_encoding: rope

These constraints drive weight loading: has_bias=false means the loader skips bias tensors entirely instead of loading zeros.

Weight Loading

Linear attention layers have different tensor names from standard layers:

Standard Attention Layer Tensors

model.layers.{n}.self_attn.q_proj.weight    [4096, 4096]
model.layers.{n}.self_attn.k_proj.weight    [1024, 4096]
model.layers.{n}.self_attn.v_proj.weight    [1024, 4096]
model.layers.{n}.self_attn.o_proj.weight    [4096, 4096]

Gated Delta Net Layer Tensors

model.layers.{n}.self_attn.in_proj_qkvz.weight  [QKVZ_dim, 4096]
model.layers.{n}.self_attn.in_proj_ba.weight     [2*num_v_heads, 4096]
model.layers.{n}.self_attn.out_proj.weight        [4096, value_dim]
model.layers.{n}.self_attn.conv1d.weight          [conv_dim, 1, kernel]
model.layers.{n}.self_attn.A_log                  [num_v_heads]
model.layers.{n}.self_attn.dt_bias                [num_v_heads]
model.layers.{n}.self_attn.norm.weight            [value_dim]

The in_proj_qkvz tensor is a combined projection that gets split into Q, K, V, and Z (gate) during loading:

in_proj_qkvz = [Q | K | V | Z]
  Q: [key_dim, hidden_dim]     -> qkv_weight (part)
  K: [key_dim, hidden_dim]     -> qkv_weight (part)
  V: [value_dim, hidden_dim]   -> qkv_weight (part)
  Z: [value_dim, hidden_dim]   -> linear_attn.z_weight

CLI Usage

# Import Qwen3.5 from HuggingFace
apr import hf://Qwen/Qwen3.5-9B-Instruct -o qwen35.apr --arch qwen3_5

# Inspect hybrid attention config
apr inspect qwen35.apr | grep -E "layer_types|linear_"

# Run inference (realizar handles dispatch)
apr run qwen35.apr --prompt "What is 2+2?" --max-tokens 32

# QA validation
apr qa qwen35.apr --assert-tps 50

Falsification Tests

The contract is protected by 8 Popperian falsification tests:

Test	What it tries to break
QWEN35-001	Exact dimensions (4096, 256, 16, 4, 248320)
QWEN35-002	has_bias must be false
QWEN35-003	hidden_dim == num_heads * head_dim
QWEN35-004	GQA divisibility (16 % 4 == 0)
QWEN35-005	Shape template dimensions
QWEN35-006	rope_theta = 1,000,000
QWEN35-007	Architecture class = Qwen3_5ForCausalLM
QWEN35-008	SwiGLU MLP (silu + swiglu)

Run: cargo test -- falsify_mf_qwen35

Fine-Tuning Support

Qwen3.5-9B is wired into the apr finetune CLI for classification fine-tuning:

# Plan mode — shows config and trainable parameter count
apr finetune --task classify --model-size 9B --plan

# Output:
#   Model: 4096h x 32L
#   LoRA: rank=16, alpha=16.0, 64 adapters
#   Classifier: 4096->5 (20485 params)
#   Total trainable: 6,836,229 params

The TransformerConfig::qwen3_5_9b() factory in entrenar mirrors the contract:

Config Field	Value	Source
hidden_size	4096	qwen3_5.yaml `hidden_dim`
num_attention_heads	16	qwen3_5.yaml `num_heads`
num_kv_heads	4	qwen3_5.yaml `num_kv_heads`
num_hidden_layers	32	qwen3_5.yaml `num_layers`
vocab_size	248320	qwen3_5.yaml `vocab_size`
use_bias	false	qwen3_5.yaml `has_bias`
head_dim()	256	4096 / 16

CLI aliases: --model-size 9B, --model-size qwen3.5-9b, --model-size qwen3.5

Fine-Tuning Falsification Tests

Seven additional tests (FALSIFY-FT-QWEN35-001..007) verify the config factory matches the contract:

Test	What it tries to break
FT-QWEN35-001	vocab_size must be 248320 (not Qwen2's 152064)
FT-QWEN35-002	use_bias must be false
FT-QWEN35-003	head_dim() must be 256
FT-QWEN35-004	num_hidden_layers must be 32
FT-QWEN35-005	num_kv_heads must be 4
FT-QWEN35-006	All YAML dimensions match config factory
FT-QWEN35-007	CLI dispatch consistency (9B != Qwen2)

Run: cargo test -- falsify_ft_qwen35

Aprender — Pure Rust ML Framework