Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

apr trace

The apr trace command traces data through each layer of the model, showing tensor shapes and statistics at every stage. For microGPT, the equivalent is cargo run --example trace_forward.

Full forward pass trace

$ cargo run --example trace_forward

Input: "emma" → tokens [0, 5, 13, 13, 1]

Stage 1: Embedding
  tok_emb = one_hot(tokens) @ wte          shape=[5, 16]  L2=0.72
  pos_emb = one_hot(pos) @ wpe             shape=[5, 16]  L2=0.74
  x = tok_emb + pos_emb                    shape=[5, 16]  L2=1.04
  x = RMSNorm(x)                           shape=[5, 16]  L2=8.94

Stage 2: Multi-Head Self-Attention
  x_n = RMSNorm(x) [pre-attn]              shape=[5, 16]  L2=8.94
  head_0_out @ wo[0]                       shape=[5, 16]  L2=0.50
  head_1_out @ wo[1]                       shape=[5, 16]  L2=0.46
  head_2_out @ wo[2]                       shape=[5, 16]  L2=0.31
  head_3_out @ wo[3]                       shape=[5, 16]  L2=0.38
  x = x_res + attn_out [residual]          shape=[5, 16]  L2=9.09

Stage 3: MLP (FFN)
  x_n = RMSNorm(x) [pre-MLP]               shape=[5, 16]  L2=8.94
  h = x_n @ w_fc1 [up project]             shape=[5, 64]  L2=5.75
  h = ReLU(h)                              shape=[5, 64]  L2=3.87  ← ~50% zeroed
  mlp_out = h @ w_fc2 [down project]       shape=[5, 16]  L2=1.17
  x = x_res + mlp_out [residual]           shape=[5, 16]  L2=9.10

Stage 4: Language Model Head
  logits = x @ w_lm                        shape=[5, 27]  L2=3.29

What the trace reveals

RMSNorm normalization

After RMSNorm, L2≈8.94 regardless of input magnitude. This is a structural invariant: RMSNorm normalizes RMS to 1.0 per row, so each row’s L2 = √16 = 4.0, and total L2 = √5 × 4.0 ≈ 8.94 for 5 rows of 16 dimensions. The pre-RMSNorm L2 (1.04 here) depends on random initialization.

Attention is a small perturbation

The attention output norms (L2≈0.3-0.5) are much smaller than the residual stream (L2≈9.0). After the residual connection, L2 barely changes: 9.0 → 9.09. The residual stream carries most of the signal; attention makes incremental adjustments.

ReLU sparsity

Before ReLU: L2=5.75. After: L2=3.87. ReLU zeros out ~50% of the 64 FFN dimensions (the negative activations), creating a sparse intermediate representation. This is visible in the range changing from [-0.90, 1.15] to [0.00, 1.15].

Untrained predictions

Before training, the model predicts wrong characters:

Predictions:
  pos 0: target='e' predicted='r' P(target)=0.035
  pos 1: target='m' predicted='g' P(target)=0.041
  pos 2: target='m' predicted='j' P(target)=0.037
  pos 3: target='a' predicted='c' P(target)=0.046
  pos 4: target='EOS' predicted='f' P(target)=0.033

The target probability is ~1/27 ≈ 0.037 — essentially random. After 5,000 training steps, these probabilities increase significantly.

Comparison with apr trace

For a production model like Qwen-1.5B:

$ apr trace qwen-1.5b-q4k.apr --verbose

Layer 0:  attn_norm → q_proj → k_proj → v_proj → attn → o_proj → residual
          → ffn_norm → gate_proj → up_proj → SiLU → ffn_down → residual
Layer 1:  ... (same pattern × 28 layers)

microGPT has the same norm → attn → residual → norm → ffn → residual pattern, just with 1 layer instead of 28 and ReLU instead of SiLU/SwiGLU.