Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

3. Model Architecture

3.1 Architecture: LLaMA-Style Decoder-Only Transformer

entrenar’s transformer is a pre-norm LLaMA-style architecture with RMSNorm, SwiGLU FFN, Grouped-Query Attention, and RoPE. This is hardcoded in the Transformer struct — we configure it via YAML, we don’t build it from scratch.

HyperparameterValueRationale
Parameters~350MFits in 4090 VRAM with optimizer state in fp16
Layers24GPT-2 Medium proven at this depth
Hidden dim (d_model)1024Standard for this param count
Attention heads16d_head = 64, well-studied
KV heads4GQA with 4:1 ratio (memory efficient)
FFN dim (intermediate)4096~4x hidden dim (SwiGLU gate + up + down)
Vocab size32,768BPE trained on corpus (power of 2 for GPU efficiency)
Context length2048 (spec) / 1024 (training)2048 OOMs at batch≥4 on 4090; training uses 1024
Position encodingRoPEBuilt into entrenar’s MultiHeadAttention
AttentionGQABuilt into entrenar, fewer KV heads than Q heads
NormalizationRMSNormBuilt into entrenar, pre-norm (before attn + FFN)
FFN activationSwiGLUBuilt into entrenar (gate_proj, up_proj, down_proj)
Dropout0.0Modern practice for pre-training (regularize via data)

3.2 Progressive Model Sizing

To validate the pipeline quickly, we train progressively larger models. Each gets its own YAML config file (see §6.2 for full config format).

ModelConfigParamsLayersHiddenHeadsPurpose
albor-50Mpretrain-50m.yaml~50M125128Pipeline validation (hours)
albor-125Mpretrain-125m.yaml~125M1676812Intermediate, first benchmarks (1-2 days)
albor-350Mpretrain-350m.yaml~350M24102416Final base model (3-7 days)

The 50M model proves the entire stack works end-to-end before committing days of GPU time to the 350M run.

3.3 VRAM Budget (fp16 mixed precision, RTX 4090)

Speculative estimates (pre-dogfooding):

ComponentSize
Model weights (fp16)~700 MB
Adam optimizer states (fp32 m, v)~2.8 GB
Gradients (fp16)~700 MB
Activations (grad checkpoint, batch=8, seq=2048)~8-12 GB
Total estimated~13-16 GB

Actual measurements (from ALB-040 dogfooding with CudaTransformerTrainer):

ConfigVRAM UsedStatus
seq=512, batch=4~18 GBPASS
seq=1024, batch=4~19.5 GBPASS (production config)
seq=2048, batch=4OOMFAIL — logits [4,2048,32768] = 1 GB exceeds budget
seq=2048, batch=8OOMFAIL — OOM at block 21 upload

The GPU-resident CudaTransformerTrainer keeps all 24 blocks in VRAM (weights + AdamW states ≈ 5 GB) plus a shared workspace for activations (~10-12 GB). This is tighter than the speculative estimate because the shared workspace includes attention score matrices that scale as O(heads × seq² × batch). Batch size is fixed at 4. Note: gradient_accumulation is set to 1 for the v2 config, though per-block CPU gradient accumulation is now fully implemented via PerBlockGradientAccumulator (D2H download, CPU averaging, H2D upload). See §6.4 for detailed breakdown.