Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Training

Dataset

32,033 names from Karpathy’s makemore dataset, character-level tokenized with a 27-token vocabulary (26 lowercase letters + BOS/EOS token at index 0).

Tokenization

"maria" → [0, 13, 1, 18, 9, 1, 0]
           BOS m   a  r   i  a  BOS

The BOS token doubles as EOS. The tokenizer drops non-lowercase characters silently, and the roundtrip property holds for all [a-z]* strings:

decode(tokenize(name)[1..-1]) == name

Loss function

Cross-entropy loss from aprender-core, with teacher forcing:

loss = CrossEntropyLoss(logits[0..n], targets[1..n+1])

The initial loss is ~3.3 (random baseline for 27 classes: -ln(1/27) ≈ 3.30). After 5,000 steps, loss converges to ~2.0.

Optimizer

Manual Adam (Kingma & Ba, 2015) operating directly on autograd tensors:

HyperparameterValue
Learning rate0.01 (linear decay to 0)
beta10.85
beta20.99
epsilon1e-8
Steps5,000

The optimizer is manual (not aprender::nn::Adam) because aprender’s Linear::forward uses a cached weight transpose whose TensorId differs from the original weight — the built-in Adam looks up gradients by ID and misses the update. Raw weight tensors with direct matmul ensure every parameter receives its gradient.

Sampling

Autoregressive generation with temperature scaling:

for pos in 0..BLOCK_SIZE:
    logits = model.forward(tokens)
    probs  = softmax(logits[last] / temperature)
    next   = weighted_sample(probs)
    if next == BOS: break
    tokens.push(next)

Temperature 0.5 produces conservative, name-like outputs.