3. Model Architecture

3.1 Architecture: LLaMA-Style Decoder-Only Transformer

entrenar’s transformer is a pre-norm LLaMA-style architecture with RMSNorm, SwiGLU FFN, Grouped-Query Attention, and RoPE. This is hardcoded in the Transformer struct — we configure it via YAML, we don’t build it from scratch.

Hyperparameter	Value	Rationale
Parameters	~350M	Fits in 4090 VRAM with optimizer state in fp16
Layers	24	GPT-2 Medium proven at this depth
Hidden dim (d_model)	1024	Standard for this param count
Attention heads	16	d_head = 64, well-studied
KV heads	4	GQA with 4:1 ratio (memory efficient)
FFN dim (intermediate)	4096	~4x hidden dim (SwiGLU gate + up + down)
Vocab size	32,768	BPE trained on corpus (power of 2 for GPU efficiency)
Context length	2048 (spec) / 1024 (training)	2048 OOMs at batch≥4 on 4090; training uses 1024
Position encoding	RoPE	Built into entrenar’s `MultiHeadAttention`
Attention	GQA	Built into entrenar, fewer KV heads than Q heads
Normalization	RMSNorm	Built into entrenar, pre-norm (before attn + FFN)
FFN activation	SwiGLU	Built into entrenar (gate_proj, up_proj, down_proj)
Dropout	0.0	Modern practice for pre-training (regularize via data)

3.2 Progressive Model Sizing

To validate the pipeline quickly, we train progressively larger models. Each gets its own YAML config file (see §6.2 for full config format).

Model	Config	Params	Layers	Hidden	Heads	Purpose
albor-50M	`pretrain-50m.yaml`	~50M	12	512	8	Pipeline validation (hours)
albor-125M	`pretrain-125m.yaml`	~125M	16	768	12	Intermediate, first benchmarks (1-2 days)
albor-350M	`pretrain-350m.yaml`	~350M	24	1024	16	Final base model (3-7 days)

The 50M model proves the entire stack works end-to-end before committing days of GPU time to the 350M run.

3.3 VRAM Budget (fp16 mixed precision, RTX 4090)

Speculative estimates (pre-dogfooding):

Component	Size
Model weights (fp16)	~700 MB
Adam optimizer states (fp32 m, v)	~2.8 GB
Gradients (fp16)	~700 MB
Activations (grad checkpoint, batch=8, seq=2048)	~8-12 GB
Total estimated	~13-16 GB

Actual measurements (from ALB-040 dogfooding with CudaTransformerTrainer):

Config	VRAM Used	Status
seq=512, batch=4	~18 GB	PASS
seq=1024, batch=4	~19.5 GB	PASS (production config)
seq=2048, batch=4	OOM	FAIL — logits [4,2048,32768] = 1 GB exceeds budget
seq=2048, batch=8	OOM	FAIL — OOM at block 21 upload

The GPU-resident CudaTransformerTrainer keeps all 24 blocks in VRAM (weights + AdamW states ≈ 5 GB) plus a shared workspace for activations (~10-12 GB). This is tighter than the speculative estimate because the shared workspace includes attention score matrices that scale as O(heads × seq² × batch). Batch size is fixed at 4. Note: gradient_accumulation is set to 1 for the v2 config, though per-block CPU gradient accumulation is now fully implemented via PerBlockGradientAccumulator (D2H download, CPU averaging, H2D upload). See §6.4 for detailed breakdown.

Keyboard shortcuts

Albor LLM Specification

3. Model Architecture

3.1 Architecture: LLaMA-Style Decoder-Only Transformer

3.2 Progressive Model Sizing

3.3 VRAM Budget (fp16 mixed precision, RTX 4090)