Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

apr explain

The apr explain command explains architecture, tensor roles, and kernel dispatch. For microGPT, the equivalent is cargo run --example explain_attention.

Attention mechanics

$ cargo run --example explain_attention

Architecture (cf. `apr explain --kernel gpt2 -v`):
  Attention type:     MHA (multi-head attention)
  Heads:              4
  Head dim:           4
  Total attn dim:     16 (4 × 4)
  Scale factor:       1/√4 = 0.5000

The attention mechanism computes:

Attention(Q, K, V) = softmax(Q @ K^T / √d_k) @ V

This is the same equation used in GPT-2, LLaMA, Qwen, and every other transformer. The only difference is scale: microGPT uses d_k=4, production models use d_k=128.

Attention weight visualization

For input “mar” (tokens [BOS, m, a, r]), the attention weights at random initialization show the causal structure (values vary per run, but the structural invariants are constant):

Head 0:
  pos 0 → [1.000, 0.000, 0.000, 0.000]   ← BOS attends only to itself
  pos 1 → [0.482, 0.518, 0.000, 0.000]   ← 'm' splits between BOS and self
  pos 2 → [0.302, 0.383, 0.315, 0.000]   ← 'a' attends to all prior
  pos 3 → [0.254, 0.247, 0.215, 0.284]   ← 'r' attends broadly

Head 3:
  pos 0 → [1.000, 0.000, 0.000, 0.000]
  pos 1 → [0.494, 0.506, 0.000, 0.000]
  pos 2 → [0.286, 0.339, 0.376, 0.000]   ← head 3 focuses more on 'a'
  pos 3 → [0.247, 0.223, 0.250, 0.280]   ← head 3 focuses more on 'r'

Structural invariants (hold for ANY initialization):

  • Row 0 is always [1, 0, 0, 0] — the BOS token can only see itself
  • Upper triangle is zero — causal masking prevents attending to the future
  • Each row sums to 1.0 — softmax normalizes attention weights

The specific non-trivial values (e.g., 0.482 vs 0.518) reflect random initialization, not learned patterns. After training, head specialization emerges.

Causal mask explained

Equivalent to apr explain --tensor attn_mask:

Causal Mask:
  Each position can only attend to itself and earlier positions.
  Future positions are masked with -1e9 → softmax drives them to ~0.

    pos 0: [  0  ,  -∞ ,  -∞ ,  -∞ ]
    pos 1: [  0  ,  0  ,  -∞ ,  -∞ ]
    pos 2: [  0  ,  0  ,  0  ,  -∞ ]
    pos 3: [  0  ,  0  ,  0  ,  0  ]

The mask uses -1e9 instead of true -∞ to satisfy the upstream softmax precondition contract (x.iter().all(|v| v.is_finite())). Since exp(-1e9) ≈ 0 in f32, this is numerically identical.

Comparison with production models

$ apr explain --kernel llama -v

Constraints (from family contract):
  attention_type:      gqa          ← microGPT uses mha (simpler)
  activation:          silu         ← microGPT uses relu (simpler)
  norm_type:           rmsnorm      ← same as microGPT ✓
  mlp_type:            swiglu       ← microGPT uses relu_mlp (simpler)
  positional_encoding: rope         ← microGPT uses learned (simpler)
  has_bias:            false        ← same as microGPT ✓

microGPT uses the simpler variants of each component. The contract structure is identical — only the kernel implementations differ.