microGPT
A port of Karpathy’s microGPT to Rust using aprender.
What is microGPT?
microGPT is a 4,192-parameter GPT that generates human-like names after training on 32K examples. It demonstrates the algorithmic essence of large language models: tokenization, embeddings, multi-head attention with causal masking, residual connections, RMSNorm, MLPs, and cross-entropy loss with backpropagation.
Everything else is just efficiency.
Why Rust + aprender?
The original Python implementation uses a custom autograd engine operating
on scalar Value objects. This Rust port uses
aprender-core for
SIMD-accelerated tensor operations with automatic differentiation,
achieving the same architecture with production-grade numerics.
Quick start
cargo run --release
Trains for 5,000 steps (~2 seconds on CPU), then samples 20 names:
step 0 | loss 3.29
step 2500 | loss 2.56
step 4999 | loss 1.95
Generated names (temperature=0.5):
karila, maria, misha, mayla, anara
Model
Architecture
| Component | Value |
|---|---|
| Embedding dim | 16 |
| Attention heads | 4 (head_dim=4) |
| Layers | 1 |
| Context length | 16 |
| Vocab | 27 (a-z + BOS) |
| Parameters | 4,192 |
Parameter breakdown
| Weight | Shape | Count |
|---|---|---|
wte (token embeddings) | [27, 16] | 432 |
wpe (position embeddings) | [16, 16] | 256 |
wq (4 heads) | 4 x [16, 4] | 256 |
wk (4 heads) | 4 x [16, 4] | 256 |
wv (4 heads) | 4 x [16, 4] | 256 |
wo (4 heads) | 4 x [4, 16] | 256 |
w_fc1 (MLP up) | [16, 64] | 1,024 |
w_fc2 (MLP down) | [64, 16] | 1,024 |
w_lm (LM head) | [16, 27] | 432 |
| Total | 4,192 |
Forward pass
tokens → one_hot → matmul(wte) + matmul(wpe) → RMSNorm
→ Attention(Q,K,V,O) + residual
→ RMSNorm → MLP(fc1 → ReLU → fc2) + residual
→ matmul(w_lm) → logits
Embedding lookup
Embeddings use a one-hot matmul instead of a dedicated Embedding layer.
This ensures gradients flow through the standard matmul backward pass:
tok_emb = one_hot(tokens, 27) @ wte // [n, 16]
pos_emb = one_hot(0..n, 16) @ wpe // [n, 16]
x = tok_emb + pos_emb
RMSNorm
Applied per-row with straight-through gradient estimation:
RMSNorm(x)_i = x_i / sqrt(mean(x_i^2) + 1e-5)
Uses -1e9 instead of -inf in the causal mask to satisfy the upstream
softmax precondition contract (x.iter().all(|v| v.is_finite())).
Attention
Scaled dot-product attention
Each of the 4 heads computes:
Q_h = X @ Wq_h // [n, 4]
K_h = X @ Wk_h // [n, 4]
V_h = X @ Wv_h // [n, 4]
scores = Q_h @ K_h^T / sqrt(4) // [n, n]
scores = scores + causal_mask // mask future positions
weights = softmax(scores) // [n, n]
head_out = weights @ V_h // [n, 4]
Multi-head accumulation
Instead of concatenating heads and projecting through a single [16, 16]
output matrix, each head projects back to the full embedding dimension and
the outputs are summed:
attn_out = sum_h (head_out_h @ Wo_h) // [n, 16]
This is mathematically equivalent to standard concat + output projection:
concat(heads) @ Wo = h0 @ Wo[0:4,:] + h1 @ Wo[4:8,:] + h2 @ Wo[8:12,:] + h3 @ Wo[12:16,:]
The per-head approach avoids needing tensor slicing operations in the
autograd graph, where aprender’s matmul only supports 2D tensors.
Causal masking
The causal mask is lower-triangular:
mask[i][j] = 0 if j <= i (attend)
mask[i][j] = -1e9 if j > i (block)
After adding the mask to attention scores, softmax(scores) drives masked
positions to effectively zero probability (exp(-1e9) ≈ 0).
Why per-head projections?
aprender’s autograd Tensor::matmul operates on 2D tensors only.
The standard approach of computing Q = X @ Wq as [n, 16] then
reshaping to [n, 4, 4] requires tensor slicing with gradient tracking,
which the autograd doesn’t support. Per-head [16, 4] projections
achieve identical parameter count (1,024 total for Q/K/V/O) while keeping
all operations in the 2D matmul autograd path.
Training
Dataset
32,033 names from Karpathy’s makemore dataset, character-level tokenized with a 27-token vocabulary (26 lowercase letters + BOS/EOS token at index 0).
Tokenization
"maria" → [0, 13, 1, 18, 9, 1, 0]
BOS m a r i a BOS
The BOS token doubles as EOS. The tokenizer drops non-lowercase characters
silently, and the roundtrip property holds for all [a-z]* strings:
decode(tokenize(name)[1..-1]) == name
Loss function
Cross-entropy loss from aprender-core, with teacher forcing:
loss = CrossEntropyLoss(logits[0..n], targets[1..n+1])
The initial loss is ~3.3 (random baseline for 27 classes: -ln(1/27) ≈ 3.30).
After 5,000 steps, loss converges to ~2.0.
Optimizer
Manual Adam (Kingma & Ba, 2015) operating directly on autograd tensors:
| Hyperparameter | Value |
|---|---|
| Learning rate | 0.01 (linear decay to 0) |
| beta1 | 0.85 |
| beta2 | 0.99 |
| epsilon | 1e-8 |
| Steps | 5,000 |
The optimizer is manual (not aprender::nn::Adam) because aprender’s
Linear::forward uses a cached weight transpose whose TensorId differs
from the original weight — the built-in Adam looks up gradients by ID and
misses the update. Raw weight tensors with direct matmul ensure every
parameter receives its gradient.
Sampling
Autoregressive generation with temperature scaling:
for pos in 0..BLOCK_SIZE:
logits = model.forward(tokens)
probs = softmax(logits[last] / temperature)
next = weighted_sample(probs)
if next == BOS: break
tokens.push(next)
Temperature 0.5 produces conservative, name-like outputs.
apr inspect
The apr inspect command reveals model structure, metadata, and tensor
statistics. For microGPT, the equivalent is cargo run --example inspect_model.
Model metadata
$ cargo run --example inspect_model
microGPT Model Inspection
══════════════════════════════════════════════════
Architecture:
Family: microGPT (character-level, 1-layer GPT)
Kernel Class: Custom (MHA + RMSNorm + ReLU)
Hidden Size: 16
Attention Heads: 4 (head_dim=4)
FFN Size: 64
Vocab Size: 27
Context Length: 16
Parameters: 4192
Compare with a production model:
$ apr inspect qwen-1.5b-q4k.apr
Architecture:
Family: qwen2
Hidden Size: 1536
Intermediate Size: 8960
Vocab Size: 151936
Max Position: 32768
RoPE Theta: 1000000
microGPT is ~360,000x smaller than Qwen-1.5B (1.5B / 4,192 ≈ 358K) but uses the same fundamental operations.
Tensor inventory
Equivalent to apr tensors <file> --stats:
Tensors (cf. `apr tensors`):
Name Shape Numel Stats
──────────────────────────────────────────────────────────────
wte (token embeddings) [27, 16] 432 μ=+0.010 σ=0.083
wpe (position embeddings) [16, 16] 256 μ=-0.010 σ=0.079
wq[0] (query proj head 0) [16, 4] 64 μ=-0.010 σ=0.081
...
w_fc1 (MLP up projection) [16, 64] 1024 μ=+0.000 σ=0.082
w_fc2 (MLP down projection) [64, 16] 1024 μ=+0.003 σ=0.079
w_lm (LM head) [16, 27] 432 μ=+0.003 σ=0.078
All weights are initialized from N(0, 0.08²), matching Karpathy’s
original Python implementation.
Kernel pipeline
Equivalent to apr explain --kernel:
┌─────────────────────────┬─────────────────────────────────┬─────────────────────────┐
│ Operation │ Implementation │ Contract │
├─────────────────────────┼─────────────────────────────────┼─────────────────────────┤
│ Embedding Lookup │ one_hot @ wte (differentiable) │ microgpt-v1 § one_hot │
│ RMSNorm │ x * (1/rms) per row │ microgpt-v1 § rms_norm │
│ Attention (4-head MHA) │ per-head Q/K/V/O matmul │ attention-kernel-v1 │
│ Causal Mask │ lower-tri 0, upper -1e9 │ microgpt-v1 § causal │
│ Softmax │ aprender autograd softmax │ softmax-kernel-v1 │
│ MLP (ReLU) │ fc1 → ReLU → fc2 │ activation-kernel-v1 │
│ LM Head │ matmul to vocab logits │ matmul-kernel-v1 │
└─────────────────────────┴─────────────────────────────────┴─────────────────────────┘
Every operation maps to a provable contract. The kernel pipeline for
a production LLaMA model (via apr explain --kernel llama) shows the
same structure at scale:
$ apr explain --kernel llama --proof-status
Kernel Pipeline (9 ops):
MatVec (Q4K) → matvec-kernel-v1 ◉ Tested
Softmax → softmax-kernel-v1 ◉ Tested
Attention (GQA) → element-wise-ops-v1 ◉ Tested
Normalization → normalization-kernel-v1 ◉ Tested
Activation → element-wise-ops-v1 ◉ Tested
MLP → element-wise-ops-v1 ◉ Tested
Position (RoPE) → rope-kernel-v1 ◉ Tested
Tensor roles
Equivalent to apr explain --tensor <name>:
| Tensor | Role |
|---|---|
wte | Token embedding — maps token IDs to dense vectors |
wpe | Position embedding — adds positional information |
wq[h] | Query projection in multi-head attention |
wk[h] | Key projection in multi-head attention |
wv[h] | Value projection in multi-head attention |
wo[h] | Output projection — combines head outputs |
w_fc1 | Feed-forward up projection (expand to 4x) |
w_fc2 | Feed-forward down projection (compress to 1x) |
w_lm | Language model head — projects to vocabulary logits |
apr explain
The apr explain command explains architecture, tensor roles, and kernel
dispatch. For microGPT, the equivalent is cargo run --example explain_attention.
Attention mechanics
$ cargo run --example explain_attention
Architecture (cf. `apr explain --kernel gpt2 -v`):
Attention type: MHA (multi-head attention)
Heads: 4
Head dim: 4
Total attn dim: 16 (4 × 4)
Scale factor: 1/√4 = 0.5000
The attention mechanism computes:
Attention(Q, K, V) = softmax(Q @ K^T / √d_k) @ V
This is the same equation used in GPT-2, LLaMA, Qwen, and every other
transformer. The only difference is scale: microGPT uses d_k=4,
production models use d_k=128.
Attention weight visualization
For input “mar” (tokens [BOS, m, a, r]), the attention weights at
random initialization show the causal structure (values vary per run,
but the structural invariants are constant):
Head 0:
pos 0 → [1.000, 0.000, 0.000, 0.000] ← BOS attends only to itself
pos 1 → [0.482, 0.518, 0.000, 0.000] ← 'm' splits between BOS and self
pos 2 → [0.302, 0.383, 0.315, 0.000] ← 'a' attends to all prior
pos 3 → [0.254, 0.247, 0.215, 0.284] ← 'r' attends broadly
Head 3:
pos 0 → [1.000, 0.000, 0.000, 0.000]
pos 1 → [0.494, 0.506, 0.000, 0.000]
pos 2 → [0.286, 0.339, 0.376, 0.000] ← head 3 focuses more on 'a'
pos 3 → [0.247, 0.223, 0.250, 0.280] ← head 3 focuses more on 'r'
Structural invariants (hold for ANY initialization):
- Row 0 is always
[1, 0, 0, 0]— the BOS token can only see itself - Upper triangle is zero — causal masking prevents attending to the future
- Each row sums to 1.0 — softmax normalizes attention weights
The specific non-trivial values (e.g., 0.482 vs 0.518) reflect random initialization, not learned patterns. After training, head specialization emerges.
Causal mask explained
Equivalent to apr explain --tensor attn_mask:
Causal Mask:
Each position can only attend to itself and earlier positions.
Future positions are masked with -1e9 → softmax drives them to ~0.
pos 0: [ 0 , -∞ , -∞ , -∞ ]
pos 1: [ 0 , 0 , -∞ , -∞ ]
pos 2: [ 0 , 0 , 0 , -∞ ]
pos 3: [ 0 , 0 , 0 , 0 ]
The mask uses -1e9 instead of true -∞ to satisfy the upstream
softmax precondition contract (x.iter().all(|v| v.is_finite())).
Since exp(-1e9) ≈ 0 in f32, this is numerically identical.
Comparison with production models
$ apr explain --kernel llama -v
Constraints (from family contract):
attention_type: gqa ← microGPT uses mha (simpler)
activation: silu ← microGPT uses relu (simpler)
norm_type: rmsnorm ← same as microGPT ✓
mlp_type: swiglu ← microGPT uses relu_mlp (simpler)
positional_encoding: rope ← microGPT uses learned (simpler)
has_bias: false ← same as microGPT ✓
microGPT uses the simpler variants of each component. The contract structure is identical — only the kernel implementations differ.
apr trace
The apr trace command traces data through each layer of the model,
showing tensor shapes and statistics at every stage. For microGPT,
the equivalent is cargo run --example trace_forward.
Full forward pass trace
$ cargo run --example trace_forward
Input: "emma" → tokens [0, 5, 13, 13, 1]
Stage 1: Embedding
tok_emb = one_hot(tokens) @ wte shape=[5, 16] L2=0.72
pos_emb = one_hot(pos) @ wpe shape=[5, 16] L2=0.74
x = tok_emb + pos_emb shape=[5, 16] L2=1.04
x = RMSNorm(x) shape=[5, 16] L2=8.94
Stage 2: Multi-Head Self-Attention
x_n = RMSNorm(x) [pre-attn] shape=[5, 16] L2=8.94
head_0_out @ wo[0] shape=[5, 16] L2=0.50
head_1_out @ wo[1] shape=[5, 16] L2=0.46
head_2_out @ wo[2] shape=[5, 16] L2=0.31
head_3_out @ wo[3] shape=[5, 16] L2=0.38
x = x_res + attn_out [residual] shape=[5, 16] L2=9.09
Stage 3: MLP (FFN)
x_n = RMSNorm(x) [pre-MLP] shape=[5, 16] L2=8.94
h = x_n @ w_fc1 [up project] shape=[5, 64] L2=5.75
h = ReLU(h) shape=[5, 64] L2=3.87 ← ~50% zeroed
mlp_out = h @ w_fc2 [down project] shape=[5, 16] L2=1.17
x = x_res + mlp_out [residual] shape=[5, 16] L2=9.10
Stage 4: Language Model Head
logits = x @ w_lm shape=[5, 27] L2=3.29
What the trace reveals
RMSNorm normalization
After RMSNorm, L2≈8.94 regardless of input magnitude. This is a
structural invariant: RMSNorm normalizes RMS to 1.0 per row, so each
row’s L2 = √16 = 4.0, and total L2 = √5 × 4.0 ≈ 8.94 for 5 rows
of 16 dimensions. The pre-RMSNorm L2 (1.04 here) depends on random
initialization.
Attention is a small perturbation
The attention output norms (L2≈0.3-0.5) are much smaller than the
residual stream (L2≈9.0). After the residual connection, L2 barely
changes: 9.0 → 9.09. The residual stream carries most of the signal;
attention makes incremental adjustments.
ReLU sparsity
Before ReLU: L2=5.75. After: L2=3.87. ReLU zeros out ~50% of the
64 FFN dimensions (the negative activations), creating a sparse
intermediate representation. This is visible in the range changing from
[-0.90, 1.15] to [0.00, 1.15].
Untrained predictions
Before training, the model predicts wrong characters:
Predictions:
pos 0: target='e' predicted='r' P(target)=0.035
pos 1: target='m' predicted='g' P(target)=0.041
pos 2: target='m' predicted='j' P(target)=0.037
pos 3: target='a' predicted='c' P(target)=0.046
pos 4: target='EOS' predicted='f' P(target)=0.033
The target probability is ~1/27 ≈ 0.037 — essentially random. After 5,000 training steps, these probabilities increase significantly.
Comparison with apr trace
For a production model like Qwen-1.5B:
$ apr trace qwen-1.5b-q4k.apr --verbose
Layer 0: attn_norm → q_proj → k_proj → v_proj → attn → o_proj → residual
→ ffn_norm → gate_proj → up_proj → SiLU → ffn_down → residual
Layer 1: ... (same pattern × 28 layers)
microGPT has the same norm → attn → residual → norm → ffn → residual
pattern, just with 1 layer instead of 28 and ReLU instead of SiLU/SwiGLU.
Verification Ladder
microGPT’s contract (microgpt-v1.yaml) is verified at multiple levels:
| Level | Method | Status | Count |
|---|---|---|---|
| L2 | Property tests (tests/contract_traits.rs) | 7/7 | 7 |
| L3 | Bounded model checking (decidable in Lean native_decide) | 7/7 | 7 |
| L4 | Lean 4 proofs (lean/MicroGPT.lean, 0 sorry) | 7/7 | 7 |
| L5 | Lean proved + fully bound | pending | 0 |
What each level means
-
L2 (Property tested): Every proof obligation has a corresponding Rust test in
tests/contract_traits.rsthat asserts the invariant at runtime. These tests run on everycargo test. -
L3 (Bounded model checked): The obligations are concrete arithmetic or structural properties decidable by
native_decidein Lean 4. -
L4 (Lean proved):
lean/MicroGPT.leancontains 6 theorems with 0sorrystubs, covering param count, Adam monotonicity, second-moment non-negativity, and tokenizer structure. -
L5 (Fully bound): Requires build.rs pipeline integration to pass binding status to
compute_proof_level. Not applicable for standalone educational repos without the full provable-contracts build pipeline.
README under contract
The README itself is verified by tests/readme_contract.rs (15 tests):
- Heading hierarchy (no skips, exactly one H1)
- Required sections (Architecture, Install, Usage, License)
- Every table claim matches a source constant
- Badge URLs are well-formed
- Table column parity
- No XSS links
Book under contract
This mdBook is verified by tests/book_contract.rs:
- SUMMARY.md heading hierarchy
- All chapter files referenced in SUMMARY.md exist
- No broken internal links
- Consistent heading structure per chapter
Contract Reference
microgpt-v1.yaml
The canonical contract for microGPT. All equations, proof obligations,
and falsification tests are defined in
contracts/microgpt-v1.yaml.
Equations
| ID | Formula | Tested | Proved |
|---|---|---|---|
one_hot | one_hot(i,C)[i][j] = 1 if j==i else 0 | L2 | L4 |
rms_norm | x / sqrt(mean(x^2) + eps) | L2 | - |
causal_mask | 0 if j<=i else -1e9 | L2 | L4 |
tokenize_decode | decode(tokenize(s)) == s | L2 | L4 |
weighted_sample | P(return i) = probs[i] | L2 | - |
adam_step | Kingma & Ba (2015) update rule | L2 | L4 |
forward_pass | LMHead(MLP(Attn(Embed(tokens)))) | L2 | L4 |
Proof obligations
| ID | Description | Level |
|---|---|---|
PARAM-COUNT-001 | Model has exactly 4,192 parameters | L4 |
ONEHOT-ROW-001 | Each one-hot row sums to 1.0 | L2 |
MASK-CAUSAL-001 | Lower-triangular structure | L2 |
TOKENIZE-ROUNDTRIP-001 | Encode/decode roundtrip identity | L2 |
ADAM-MONOTONIC-001 | Step counter advances by 1 | L4 |
ADAM-V-NONNEG-001 | Second moment >= 0 | L4 |
FORWARD-SHAPE-001 | Output shape [n, 27] | L2 |
Bindings
8 bindings in contracts/binding.yaml, all status: implemented.
| Equation | Function | Signature |
|---|---|---|
one_hot | one_hot | fn(indices, num_classes) -> Tensor |
rms_norm | rms_norm | fn(x: &Tensor) -> Tensor |
causal_mask | causal_mask | fn(size) -> Tensor |
tokenize_decode | tokenize | fn(name) -> Vec<usize> |
tokenize_decode | decode | fn(tokens) -> String |
weighted_sample | weighted_sample_with_r | fn(probs, r) -> usize |
adam_step | Adam::step | fn(&mut self, params, lr) |
forward_pass | MicroGPT::forward | fn(&self, tokens) -> Tensor |