Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

microGPT

A port of Karpathy’s microGPT to Rust using aprender.

What is microGPT?

microGPT is a 4,192-parameter GPT that generates human-like names after training on 32K examples. It demonstrates the algorithmic essence of large language models: tokenization, embeddings, multi-head attention with causal masking, residual connections, RMSNorm, MLPs, and cross-entropy loss with backpropagation.

Everything else is just efficiency.

Why Rust + aprender?

The original Python implementation uses a custom autograd engine operating on scalar Value objects. This Rust port uses aprender-core for SIMD-accelerated tensor operations with automatic differentiation, achieving the same architecture with production-grade numerics.

Quick start

cargo run --release

Trains for 5,000 steps (~2 seconds on CPU), then samples 20 names:

step    0 | loss 3.29
step 2500 | loss 2.56
step 4999 | loss 1.95

Generated names (temperature=0.5):
  karila, maria, misha, mayla, anara

Model

Architecture

ComponentValue
Embedding dim16
Attention heads4 (head_dim=4)
Layers1
Context length16
Vocab27 (a-z + BOS)
Parameters4,192

Parameter breakdown

WeightShapeCount
wte (token embeddings)[27, 16]432
wpe (position embeddings)[16, 16]256
wq (4 heads)4 x [16, 4]256
wk (4 heads)4 x [16, 4]256
wv (4 heads)4 x [16, 4]256
wo (4 heads)4 x [4, 16]256
w_fc1 (MLP up)[16, 64]1,024
w_fc2 (MLP down)[64, 16]1,024
w_lm (LM head)[16, 27]432
Total4,192

Forward pass

tokens → one_hot → matmul(wte) + matmul(wpe) → RMSNorm
       → Attention(Q,K,V,O) + residual
       → RMSNorm → MLP(fc1 → ReLU → fc2) + residual
       → matmul(w_lm) → logits

Embedding lookup

Embeddings use a one-hot matmul instead of a dedicated Embedding layer. This ensures gradients flow through the standard matmul backward pass:

tok_emb = one_hot(tokens, 27) @ wte   // [n, 16]
pos_emb = one_hot(0..n, 16)  @ wpe   // [n, 16]
x = tok_emb + pos_emb

RMSNorm

Applied per-row with straight-through gradient estimation:

RMSNorm(x)_i = x_i / sqrt(mean(x_i^2) + 1e-5)

Uses -1e9 instead of -inf in the causal mask to satisfy the upstream softmax precondition contract (x.iter().all(|v| v.is_finite())).

Attention

Scaled dot-product attention

Each of the 4 heads computes:

Q_h = X @ Wq_h     // [n, 4]
K_h = X @ Wk_h     // [n, 4]
V_h = X @ Wv_h     // [n, 4]

scores = Q_h @ K_h^T / sqrt(4)     // [n, n]
scores = scores + causal_mask       // mask future positions
weights = softmax(scores)           // [n, n]
head_out = weights @ V_h            // [n, 4]

Multi-head accumulation

Instead of concatenating heads and projecting through a single [16, 16] output matrix, each head projects back to the full embedding dimension and the outputs are summed:

attn_out = sum_h (head_out_h @ Wo_h)   // [n, 16]

This is mathematically equivalent to standard concat + output projection:

concat(heads) @ Wo = h0 @ Wo[0:4,:] + h1 @ Wo[4:8,:] + h2 @ Wo[8:12,:] + h3 @ Wo[12:16,:]

The per-head approach avoids needing tensor slicing operations in the autograd graph, where aprender’s matmul only supports 2D tensors.

Causal masking

The causal mask is lower-triangular:

mask[i][j] = 0    if j <= i  (attend)
mask[i][j] = -1e9 if j > i   (block)

After adding the mask to attention scores, softmax(scores) drives masked positions to effectively zero probability (exp(-1e9) ≈ 0).

Why per-head projections?

aprender’s autograd Tensor::matmul operates on 2D tensors only. The standard approach of computing Q = X @ Wq as [n, 16] then reshaping to [n, 4, 4] requires tensor slicing with gradient tracking, which the autograd doesn’t support. Per-head [16, 4] projections achieve identical parameter count (1,024 total for Q/K/V/O) while keeping all operations in the 2D matmul autograd path.

Training

Dataset

32,033 names from Karpathy’s makemore dataset, character-level tokenized with a 27-token vocabulary (26 lowercase letters + BOS/EOS token at index 0).

Tokenization

"maria" → [0, 13, 1, 18, 9, 1, 0]
           BOS m   a  r   i  a  BOS

The BOS token doubles as EOS. The tokenizer drops non-lowercase characters silently, and the roundtrip property holds for all [a-z]* strings:

decode(tokenize(name)[1..-1]) == name

Loss function

Cross-entropy loss from aprender-core, with teacher forcing:

loss = CrossEntropyLoss(logits[0..n], targets[1..n+1])

The initial loss is ~3.3 (random baseline for 27 classes: -ln(1/27) ≈ 3.30). After 5,000 steps, loss converges to ~2.0.

Optimizer

Manual Adam (Kingma & Ba, 2015) operating directly on autograd tensors:

HyperparameterValue
Learning rate0.01 (linear decay to 0)
beta10.85
beta20.99
epsilon1e-8
Steps5,000

The optimizer is manual (not aprender::nn::Adam) because aprender’s Linear::forward uses a cached weight transpose whose TensorId differs from the original weight — the built-in Adam looks up gradients by ID and misses the update. Raw weight tensors with direct matmul ensure every parameter receives its gradient.

Sampling

Autoregressive generation with temperature scaling:

for pos in 0..BLOCK_SIZE:
    logits = model.forward(tokens)
    probs  = softmax(logits[last] / temperature)
    next   = weighted_sample(probs)
    if next == BOS: break
    tokens.push(next)

Temperature 0.5 produces conservative, name-like outputs.

apr inspect

The apr inspect command reveals model structure, metadata, and tensor statistics. For microGPT, the equivalent is cargo run --example inspect_model.

Model metadata

$ cargo run --example inspect_model

microGPT Model Inspection
══════════════════════════════════════════════════

Architecture:
  Family:          microGPT (character-level, 1-layer GPT)
  Kernel Class:    Custom (MHA + RMSNorm + ReLU)
  Hidden Size:     16
  Attention Heads: 4 (head_dim=4)
  FFN Size:        64
  Vocab Size:      27
  Context Length:   16
  Parameters:      4192

Compare with a production model:

$ apr inspect qwen-1.5b-q4k.apr

  Architecture:
    Family: qwen2
    Hidden Size: 1536
    Intermediate Size: 8960
    Vocab Size: 151936
    Max Position: 32768
    RoPE Theta: 1000000

microGPT is ~360,000x smaller than Qwen-1.5B (1.5B / 4,192 ≈ 358K) but uses the same fundamental operations.

Tensor inventory

Equivalent to apr tensors <file> --stats:

Tensors (cf. `apr tensors`):
  Name                                  Shape   Numel  Stats
  ──────────────────────────────────────────────────────────────
  wte (token embeddings)         [27, 16]     432  μ=+0.010  σ=0.083
  wpe (position embeddings)      [16, 16]     256  μ=-0.010  σ=0.079
  wq[0] (query proj head 0)      [16,  4]      64  μ=-0.010  σ=0.081
  ...
  w_fc1 (MLP up projection)      [16, 64]    1024  μ=+0.000  σ=0.082
  w_fc2 (MLP down projection)    [64, 16]    1024  μ=+0.003  σ=0.079
  w_lm (LM head)                 [16, 27]     432  μ=+0.003  σ=0.078

All weights are initialized from N(0, 0.08²), matching Karpathy’s original Python implementation.

Kernel pipeline

Equivalent to apr explain --kernel:

┌─────────────────────────┬─────────────────────────────────┬─────────────────────────┐
│ Operation               │ Implementation                  │ Contract                │
├─────────────────────────┼─────────────────────────────────┼─────────────────────────┤
│ Embedding Lookup        │ one_hot @ wte (differentiable)  │ microgpt-v1 § one_hot   │
│ RMSNorm                 │ x * (1/rms) per row             │ microgpt-v1 § rms_norm  │
│ Attention (4-head MHA)  │ per-head Q/K/V/O matmul        │ attention-kernel-v1     │
│ Causal Mask             │ lower-tri 0, upper -1e9         │ microgpt-v1 § causal    │
│ Softmax                 │ aprender autograd softmax       │ softmax-kernel-v1       │
│ MLP (ReLU)              │ fc1 → ReLU → fc2               │ activation-kernel-v1    │
│ LM Head                 │ matmul to vocab logits          │ matmul-kernel-v1        │
└─────────────────────────┴─────────────────────────────────┴─────────────────────────┘

Every operation maps to a provable contract. The kernel pipeline for a production LLaMA model (via apr explain --kernel llama) shows the same structure at scale:

$ apr explain --kernel llama --proof-status

Kernel Pipeline (9 ops):
  MatVec (Q4K)    → matvec-kernel-v1         ◉ Tested
  Softmax         → softmax-kernel-v1        ◉ Tested
  Attention (GQA) → element-wise-ops-v1      ◉ Tested
  Normalization   → normalization-kernel-v1  ◉ Tested
  Activation      → element-wise-ops-v1      ◉ Tested
  MLP             → element-wise-ops-v1      ◉ Tested
  Position (RoPE) → rope-kernel-v1           ◉ Tested

Tensor roles

Equivalent to apr explain --tensor <name>:

TensorRole
wteToken embedding — maps token IDs to dense vectors
wpePosition embedding — adds positional information
wq[h]Query projection in multi-head attention
wk[h]Key projection in multi-head attention
wv[h]Value projection in multi-head attention
wo[h]Output projection — combines head outputs
w_fc1Feed-forward up projection (expand to 4x)
w_fc2Feed-forward down projection (compress to 1x)
w_lmLanguage model head — projects to vocabulary logits

apr explain

The apr explain command explains architecture, tensor roles, and kernel dispatch. For microGPT, the equivalent is cargo run --example explain_attention.

Attention mechanics

$ cargo run --example explain_attention

Architecture (cf. `apr explain --kernel gpt2 -v`):
  Attention type:     MHA (multi-head attention)
  Heads:              4
  Head dim:           4
  Total attn dim:     16 (4 × 4)
  Scale factor:       1/√4 = 0.5000

The attention mechanism computes:

Attention(Q, K, V) = softmax(Q @ K^T / √d_k) @ V

This is the same equation used in GPT-2, LLaMA, Qwen, and every other transformer. The only difference is scale: microGPT uses d_k=4, production models use d_k=128.

Attention weight visualization

For input “mar” (tokens [BOS, m, a, r]), the attention weights at random initialization show the causal structure (values vary per run, but the structural invariants are constant):

Head 0:
  pos 0 → [1.000, 0.000, 0.000, 0.000]   ← BOS attends only to itself
  pos 1 → [0.482, 0.518, 0.000, 0.000]   ← 'm' splits between BOS and self
  pos 2 → [0.302, 0.383, 0.315, 0.000]   ← 'a' attends to all prior
  pos 3 → [0.254, 0.247, 0.215, 0.284]   ← 'r' attends broadly

Head 3:
  pos 0 → [1.000, 0.000, 0.000, 0.000]
  pos 1 → [0.494, 0.506, 0.000, 0.000]
  pos 2 → [0.286, 0.339, 0.376, 0.000]   ← head 3 focuses more on 'a'
  pos 3 → [0.247, 0.223, 0.250, 0.280]   ← head 3 focuses more on 'r'

Structural invariants (hold for ANY initialization):

  • Row 0 is always [1, 0, 0, 0] — the BOS token can only see itself
  • Upper triangle is zero — causal masking prevents attending to the future
  • Each row sums to 1.0 — softmax normalizes attention weights

The specific non-trivial values (e.g., 0.482 vs 0.518) reflect random initialization, not learned patterns. After training, head specialization emerges.

Causal mask explained

Equivalent to apr explain --tensor attn_mask:

Causal Mask:
  Each position can only attend to itself and earlier positions.
  Future positions are masked with -1e9 → softmax drives them to ~0.

    pos 0: [  0  ,  -∞ ,  -∞ ,  -∞ ]
    pos 1: [  0  ,  0  ,  -∞ ,  -∞ ]
    pos 2: [  0  ,  0  ,  0  ,  -∞ ]
    pos 3: [  0  ,  0  ,  0  ,  0  ]

The mask uses -1e9 instead of true -∞ to satisfy the upstream softmax precondition contract (x.iter().all(|v| v.is_finite())). Since exp(-1e9) ≈ 0 in f32, this is numerically identical.

Comparison with production models

$ apr explain --kernel llama -v

Constraints (from family contract):
  attention_type:      gqa          ← microGPT uses mha (simpler)
  activation:          silu         ← microGPT uses relu (simpler)
  norm_type:           rmsnorm      ← same as microGPT ✓
  mlp_type:            swiglu       ← microGPT uses relu_mlp (simpler)
  positional_encoding: rope         ← microGPT uses learned (simpler)
  has_bias:            false        ← same as microGPT ✓

microGPT uses the simpler variants of each component. The contract structure is identical — only the kernel implementations differ.

apr trace

The apr trace command traces data through each layer of the model, showing tensor shapes and statistics at every stage. For microGPT, the equivalent is cargo run --example trace_forward.

Full forward pass trace

$ cargo run --example trace_forward

Input: "emma" → tokens [0, 5, 13, 13, 1]

Stage 1: Embedding
  tok_emb = one_hot(tokens) @ wte          shape=[5, 16]  L2=0.72
  pos_emb = one_hot(pos) @ wpe             shape=[5, 16]  L2=0.74
  x = tok_emb + pos_emb                    shape=[5, 16]  L2=1.04
  x = RMSNorm(x)                           shape=[5, 16]  L2=8.94

Stage 2: Multi-Head Self-Attention
  x_n = RMSNorm(x) [pre-attn]              shape=[5, 16]  L2=8.94
  head_0_out @ wo[0]                       shape=[5, 16]  L2=0.50
  head_1_out @ wo[1]                       shape=[5, 16]  L2=0.46
  head_2_out @ wo[2]                       shape=[5, 16]  L2=0.31
  head_3_out @ wo[3]                       shape=[5, 16]  L2=0.38
  x = x_res + attn_out [residual]          shape=[5, 16]  L2=9.09

Stage 3: MLP (FFN)
  x_n = RMSNorm(x) [pre-MLP]               shape=[5, 16]  L2=8.94
  h = x_n @ w_fc1 [up project]             shape=[5, 64]  L2=5.75
  h = ReLU(h)                              shape=[5, 64]  L2=3.87  ← ~50% zeroed
  mlp_out = h @ w_fc2 [down project]       shape=[5, 16]  L2=1.17
  x = x_res + mlp_out [residual]           shape=[5, 16]  L2=9.10

Stage 4: Language Model Head
  logits = x @ w_lm                        shape=[5, 27]  L2=3.29

What the trace reveals

RMSNorm normalization

After RMSNorm, L2≈8.94 regardless of input magnitude. This is a structural invariant: RMSNorm normalizes RMS to 1.0 per row, so each row’s L2 = √16 = 4.0, and total L2 = √5 × 4.0 ≈ 8.94 for 5 rows of 16 dimensions. The pre-RMSNorm L2 (1.04 here) depends on random initialization.

Attention is a small perturbation

The attention output norms (L2≈0.3-0.5) are much smaller than the residual stream (L2≈9.0). After the residual connection, L2 barely changes: 9.0 → 9.09. The residual stream carries most of the signal; attention makes incremental adjustments.

ReLU sparsity

Before ReLU: L2=5.75. After: L2=3.87. ReLU zeros out ~50% of the 64 FFN dimensions (the negative activations), creating a sparse intermediate representation. This is visible in the range changing from [-0.90, 1.15] to [0.00, 1.15].

Untrained predictions

Before training, the model predicts wrong characters:

Predictions:
  pos 0: target='e' predicted='r' P(target)=0.035
  pos 1: target='m' predicted='g' P(target)=0.041
  pos 2: target='m' predicted='j' P(target)=0.037
  pos 3: target='a' predicted='c' P(target)=0.046
  pos 4: target='EOS' predicted='f' P(target)=0.033

The target probability is ~1/27 ≈ 0.037 — essentially random. After 5,000 training steps, these probabilities increase significantly.

Comparison with apr trace

For a production model like Qwen-1.5B:

$ apr trace qwen-1.5b-q4k.apr --verbose

Layer 0:  attn_norm → q_proj → k_proj → v_proj → attn → o_proj → residual
          → ffn_norm → gate_proj → up_proj → SiLU → ffn_down → residual
Layer 1:  ... (same pattern × 28 layers)

microGPT has the same norm → attn → residual → norm → ffn → residual pattern, just with 1 layer instead of 28 and ReLU instead of SiLU/SwiGLU.

Verification Ladder

microGPT’s contract (microgpt-v1.yaml) is verified at multiple levels:

LevelMethodStatusCount
L2Property tests (tests/contract_traits.rs)7/77
L3Bounded model checking (decidable in Lean native_decide)7/77
L4Lean 4 proofs (lean/MicroGPT.lean, 0 sorry)7/77
L5Lean proved + fully boundpending0

What each level means

  • L2 (Property tested): Every proof obligation has a corresponding Rust test in tests/contract_traits.rs that asserts the invariant at runtime. These tests run on every cargo test.

  • L3 (Bounded model checked): The obligations are concrete arithmetic or structural properties decidable by native_decide in Lean 4.

  • L4 (Lean proved): lean/MicroGPT.lean contains 6 theorems with 0 sorry stubs, covering param count, Adam monotonicity, second-moment non-negativity, and tokenizer structure.

  • L5 (Fully bound): Requires build.rs pipeline integration to pass binding status to compute_proof_level. Not applicable for standalone educational repos without the full provable-contracts build pipeline.

README under contract

The README itself is verified by tests/readme_contract.rs (15 tests):

  • Heading hierarchy (no skips, exactly one H1)
  • Required sections (Architecture, Install, Usage, License)
  • Every table claim matches a source constant
  • Badge URLs are well-formed
  • Table column parity
  • No XSS links

Book under contract

This mdBook is verified by tests/book_contract.rs:

  • SUMMARY.md heading hierarchy
  • All chapter files referenced in SUMMARY.md exist
  • No broken internal links
  • Consistent heading structure per chapter

Contract Reference

microgpt-v1.yaml

The canonical contract for microGPT. All equations, proof obligations, and falsification tests are defined in contracts/microgpt-v1.yaml.

Equations

IDFormulaTestedProved
one_hotone_hot(i,C)[i][j] = 1 if j==i else 0L2L4
rms_normx / sqrt(mean(x^2) + eps)L2-
causal_mask0 if j<=i else -1e9L2L4
tokenize_decodedecode(tokenize(s)) == sL2L4
weighted_sampleP(return i) = probs[i]L2-
adam_stepKingma & Ba (2015) update ruleL2L4
forward_passLMHead(MLP(Attn(Embed(tokens))))L2L4

Proof obligations

IDDescriptionLevel
PARAM-COUNT-001Model has exactly 4,192 parametersL4
ONEHOT-ROW-001Each one-hot row sums to 1.0L2
MASK-CAUSAL-001Lower-triangular structureL2
TOKENIZE-ROUNDTRIP-001Encode/decode roundtrip identityL2
ADAM-MONOTONIC-001Step counter advances by 1L4
ADAM-V-NONNEG-001Second moment >= 0L4
FORWARD-SHAPE-001Output shape [n, 27]L2

Bindings

8 bindings in contracts/binding.yaml, all status: implemented.

EquationFunctionSignature
one_hotone_hotfn(indices, num_classes) -> Tensor
rms_normrms_normfn(x: &Tensor) -> Tensor
causal_maskcausal_maskfn(size) -> Tensor
tokenize_decodetokenizefn(name) -> Vec<usize>
tokenize_decodedecodefn(tokens) -> String
weighted_sampleweighted_sample_with_rfn(probs, r) -> usize
adam_stepAdam::stepfn(&mut self, params, lr)
forward_passMicroGPT::forwardfn(&self, tokens) -> Tensor