apr trace

The apr trace command traces data through each layer of the model, showing tensor shapes and statistics at every stage. For microGPT, the equivalent is cargo run --example trace_forward.

Full forward pass trace

$ cargo run --example trace_forward

Input: "emma" → tokens [0, 5, 13, 13, 1]

Stage 1: Embedding
  tok_emb = one_hot(tokens) @ wte          shape=[5, 16]  L2=0.72
  pos_emb = one_hot(pos) @ wpe             shape=[5, 16]  L2=0.74
  x = tok_emb + pos_emb                    shape=[5, 16]  L2=1.04
  x = RMSNorm(x)                           shape=[5, 16]  L2=8.94

Stage 2: Multi-Head Self-Attention
  x_n = RMSNorm(x) [pre-attn]              shape=[5, 16]  L2=8.94
  head_0_out @ wo[0]                       shape=[5, 16]  L2=0.50
  head_1_out @ wo[1]                       shape=[5, 16]  L2=0.46
  head_2_out @ wo[2]                       shape=[5, 16]  L2=0.31
  head_3_out @ wo[3]                       shape=[5, 16]  L2=0.38
  x = x_res + attn_out [residual]          shape=[5, 16]  L2=9.09

Stage 3: MLP (FFN)
  x_n = RMSNorm(x) [pre-MLP]               shape=[5, 16]  L2=8.94
  h = x_n @ w_fc1 [up project]             shape=[5, 64]  L2=5.75
  h = ReLU(h)                              shape=[5, 64]  L2=3.87  ← ~50% zeroed
  mlp_out = h @ w_fc2 [down project]       shape=[5, 16]  L2=1.17
  x = x_res + mlp_out [residual]           shape=[5, 16]  L2=9.10

Stage 4: Language Model Head
  logits = x @ w_lm                        shape=[5, 27]  L2=3.29

After RMSNorm, L2≈8.94 regardless of input magnitude. This is a structural invariant: RMSNorm normalizes RMS to 1.0 per row, so each row’s L2 = √16 = 4.0, and total L2 = √5 × 4.0 ≈ 8.94 for 5 rows of 16 dimensions. The pre-RMSNorm L2 (1.04 here) depends on random initialization.

Attention is a small perturbation

The attention output norms (L2≈0.3-0.5) are much smaller than the residual stream (L2≈9.0). After the residual connection, L2 barely changes: 9.0 → 9.09. The residual stream carries most of the signal; attention makes incremental adjustments.

ReLU sparsity

Before ReLU: L2=5.75. After: L2=3.87. ReLU zeros out ~50% of the 64 FFN dimensions (the negative activations), creating a sparse intermediate representation. This is visible in the range changing from [-0.90, 1.15] to [0.00, 1.15].

Untrained predictions

Before training, the model predicts wrong characters:

Predictions:
  pos 0: target='e' predicted='r' P(target)=0.035
  pos 1: target='m' predicted='g' P(target)=0.041
  pos 2: target='m' predicted='j' P(target)=0.037
  pos 3: target='a' predicted='c' P(target)=0.046
  pos 4: target='EOS' predicted='f' P(target)=0.033

The target probability is ~1/27 ≈ 0.037 — essentially random. After 5,000 training steps, these probabilities increase significantly.

Comparison with apr trace

For a production model like Qwen-1.5B:

$ apr trace qwen-1.5b-q4k.apr --verbose

Layer 0:  attn_norm → q_proj → k_proj → v_proj → attn → o_proj → residual
          → ffn_norm → gate_proj → up_proj → SiLU → ffn_down → residual
Layer 1:  ... (same pattern × 28 layers)

microGPT has the same norm → attn → residual → norm → ffn → residual pattern, just with 1 layer instead of 28 and ReLU instead of SiLU/SwiGLU.

microGPT