Qwen3.5 Hybrid Attention Architecture

Qwen3.5-9B-Instruct introduces a hybrid attention architecture that alternates between standard softmax attention and Gated Delta Net (GDN) linear attention layers. This chapter explains the architecture, the contract-driven config pipeline, and how to use it with apr.

Architecture Overview

Qwen3.5 uses a 32-layer transformer with two types of attention:

Layer TypeMechanismComplexityState
Standard (softmax)Q*K^T / sqrt(d) → softmax → VO(n^2)KV cache
Linear (GDN)Recurrent state updateO(n) per stepRecurrent + Conv

The layer_types field in config.json specifies which layers use which mechanism:

{
  "layer_types": [
    "attention", "linear", "attention", "linear",
    ...
  ]
}

Gated Delta Net Recurrence

Linear attention layers implement the Gated Delta Net equations:

GDN-1: S_t = exp(g_t) * S_{t-1} + k_t (x) delta_t
GDN-2: delta_t = beta_t * (v_t - S_{t-1}^T k_t)
GDN-3: o_t = S_t^T q_t

Where:

  • g_t = -exp(A_log) * softplus(a_t + dt_bias) is the decay factor
  • beta_t = sigma(b_t) is the update gate
  • Q, K are L2-normalized
  • State S is a [num_v_heads, key_head_dim, value_head_dim] matrix

Key Differences from Qwen2

ParameterQwen2-7BQwen3.5-9B
head_dim128 (computed)256 (explicit)
num_heads2816
num_kv_heads44
attention_biastruefalse
layer_typesall standardhybrid
vocab_size152064248320

Config Pipeline

The Qwen3.5 config flows through three stages:

SafeTensors config.json
    |
    v  (SafetensorsConfig with layer_types, head_dim, linear_* fields)
AprTransformerConfig
    |
    v  (config_to_gpu with explicit_head_dim, layer_types)
GpuModelConfig
    |
    v  (is_linear_layer(block_idx) dispatch)
forward_linear_block_incremental() or forward_block_incremental()

Contract-Driven Validation

The contracts/model-families/qwen3_5.yaml contract enforces:

constraints:
  attention_type: gqa
  has_bias: "false"        # No attention bias tensors
  activation: silu
  mlp_type: swiglu
  positional_encoding: rope

These constraints drive weight loading: has_bias=false means the loader skips bias tensors entirely instead of loading zeros.

Weight Loading

Linear attention layers have different tensor names from standard layers:

Standard Attention Layer Tensors

model.layers.{n}.self_attn.q_proj.weight    [4096, 4096]
model.layers.{n}.self_attn.k_proj.weight    [1024, 4096]
model.layers.{n}.self_attn.v_proj.weight    [1024, 4096]
model.layers.{n}.self_attn.o_proj.weight    [4096, 4096]

Gated Delta Net Layer Tensors

model.layers.{n}.self_attn.in_proj_qkvz.weight  [QKVZ_dim, 4096]
model.layers.{n}.self_attn.in_proj_ba.weight     [2*num_v_heads, 4096]
model.layers.{n}.self_attn.out_proj.weight        [4096, value_dim]
model.layers.{n}.self_attn.conv1d.weight          [conv_dim, 1, kernel]
model.layers.{n}.self_attn.A_log                  [num_v_heads]
model.layers.{n}.self_attn.dt_bias                [num_v_heads]
model.layers.{n}.self_attn.norm.weight            [value_dim]

The in_proj_qkvz tensor is a combined projection that gets split into Q, K, V, and Z (gate) during loading:

in_proj_qkvz = [Q | K | V | Z]
  Q: [key_dim, hidden_dim]     -> qkv_weight (part)
  K: [key_dim, hidden_dim]     -> qkv_weight (part)
  V: [value_dim, hidden_dim]   -> qkv_weight (part)
  Z: [value_dim, hidden_dim]   -> linear_attn.z_weight

CLI Usage

# Import Qwen3.5 from HuggingFace
apr import hf://Qwen/Qwen3.5-9B-Instruct -o qwen35.apr --arch qwen3_5

# Inspect hybrid attention config
apr inspect qwen35.apr | grep -E "layer_types|linear_"

# Run inference (realizar handles dispatch)
apr run qwen35.apr --prompt "What is 2+2?" --max-tokens 32

# QA validation
apr qa qwen35.apr --assert-tps 50

Falsification Tests

The contract is protected by 8 Popperian falsification tests:

TestWhat it tries to break
QWEN35-001Exact dimensions (4096, 256, 16, 4, 248320)
QWEN35-002has_bias must be false
QWEN35-003hidden_dim == num_heads * head_dim
QWEN35-004GQA divisibility (16 % 4 == 0)
QWEN35-005Shape template dimensions
QWEN35-006rope_theta = 1,000,000
QWEN35-007Architecture class = Qwen3_5ForCausalLM
QWEN35-008SwiGLU MLP (silu + swiglu)

Run: cargo test -- falsify_mf_qwen35

Fine-Tuning Support

Qwen3.5-9B is wired into the apr finetune CLI for classification fine-tuning:

# Plan mode — shows config and trainable parameter count
apr finetune --task classify --model-size 9B --plan

# Output:
#   Model: 4096h x 32L
#   LoRA: rank=16, alpha=16.0, 64 adapters
#   Classifier: 4096->5 (20485 params)
#   Total trainable: 6,836,229 params

The TransformerConfig::qwen3_5_9b() factory in entrenar mirrors the contract:

Config FieldValueSource
hidden_size4096qwen3_5.yaml hidden_dim
num_attention_heads16qwen3_5.yaml num_heads
num_kv_heads4qwen3_5.yaml num_kv_heads
num_hidden_layers32qwen3_5.yaml num_layers
vocab_size248320qwen3_5.yaml vocab_size
use_biasfalseqwen3_5.yaml has_bias
head_dim()2564096 / 16

CLI aliases: --model-size 9B, --model-size qwen3.5-9b, --model-size qwen3.5

Fine-Tuning Falsification Tests

Seven additional tests (FALSIFY-FT-QWEN35-001..007) verify the config factory matches the contract:

TestWhat it tries to break
FT-QWEN35-001vocab_size must be 248320 (not Qwen2's 152064)
FT-QWEN35-002use_bias must be false
FT-QWEN35-003head_dim() must be 256
FT-QWEN35-004num_hidden_layers must be 32
FT-QWEN35-005num_kv_heads must be 4
FT-QWEN35-006All YAML dimensions match config factory
FT-QWEN35-007CLI dispatch consistency (9B != Qwen2)

Run: cargo test -- falsify_ft_qwen35