Training

Dataset

32,033 names from Karpathy’s makemore dataset, character-level tokenized with a 27-token vocabulary (26 lowercase letters + BOS/EOS token at index 0).

Tokenization

"maria" → [0, 13, 1, 18, 9, 1, 0]
           BOS m   a  r   i  a  BOS

The BOS token doubles as EOS. The tokenizer drops non-lowercase characters silently, and the roundtrip property holds for all [a-z]* strings:

decode(tokenize(name)[1..-1]) == name

Loss function

Cross-entropy loss from aprender-core, with teacher forcing:

loss = CrossEntropyLoss(logits[0..n], targets[1..n+1])

The initial loss is ~3.3 (random baseline for 27 classes: -ln(1/27) ≈ 3.30). After 5,000 steps, loss converges to ~2.0.

Optimizer

Manual Adam (Kingma & Ba, 2015) operating directly on autograd tensors:

Hyperparameter	Value
Learning rate	0.01 (linear decay to 0)
beta1	0.85
beta2	0.99
epsilon	1e-8
Steps	5,000

The optimizer is manual (not aprender::nn::Adam) because aprender’s Linear::forward uses a cached weight transpose whose TensorId differs from the original weight — the built-in Adam looks up gradients by ID and misses the update. Raw weight tensors with direct matmul ensure every parameter receives its gradient.

Sampling

Autoregressive generation with temperature scaling:

for pos in 0..BLOCK_SIZE:
    logits = model.forward(tokens)
    probs  = softmax(logits[last] / temperature)
    next   = weighted_sample(probs)
    if next == BOS: break
    tokens.push(next)

Temperature 0.5 produces conservative, name-like outputs.