6. Training Configuration
6.1 Optimizer & Schedule
| Parameter | Value | Rationale |
|---|---|---|
| Optimizer | AdamW | Standard; in aprender/entrenar |
| Learning rate | 3e-4 | Chinchilla-recommended for 350M |
| Weight decay | 0.1 | Standard AdamW regularization |
| Beta1, Beta2 | 0.9, 0.95 | LLaMA/GPT-3 standard |
| Epsilon | 1e-8 | Standard |
| LR schedule | Cosine annealing with warmup | CosineAnnealingLR in aprender |
| Warmup steps | 2000 (v1) / 500 (v2) | ALB-060: 2000/5000 = 40%, not 0.2%. v2 config uses 500 (10%) per C-TRAINCFG-001 |
| Min LR | 3e-5 | 10% of peak (standard) |
| Gradient clipping | 1.0 (global norm) | Stability |
| Batch size (global) | 512K tokens | ~512 sequences x 1024 tokens |
| Micro-batch (4090) | 4 | GPU-resident (batch=8 OOM at seq≥1024) |
| Gradient accumulation | 1 (ALB-066) | Per-block CPU accumulation now works (PerBlockGradientAccumulator); kept at 1 for v2 config |
| Total training tokens | Target 10B; current 139M (v2 dataset) | ~5000 steps × 4 seqs × 1024 tokens = 20M tokens/run (v2: 68K seqs) |
| Mixed precision | fp16 (CUDA) | Hardware-appropriate |
6.2 Training Config: configs/train/pretrain-350m-v2.yaml
A single YAML file defines everything — model architecture and training
hyperparameters. This is the industry standard (Axolotl, torchtune, HuggingFace
Trainer). One file, one truth. apr train validate lints it before GPU time.
Current config (v2 — expanded dataset, ALB-066 gradient_accumulation=1):
# configs/train/pretrain-350m-v2.yaml — Albor 350M with expanded dataset
# C-TRAINCFG-001: steps_per_epoch=16994 >= max_steps=5000
model:
path: "." # From scratch (random init)
mode: transformer
architecture:
hidden_size: 1024 # d_model
num_hidden_layers: 24
num_attention_heads: 16 # d_head = 64
num_key_value_heads: 4 # GQA 4:1 ratio
intermediate_size: 4096 # SwiGLU FFN (gate + up + down)
vocab_size: 32768 # ByteLevel BPE (v2 tokenizer)
max_position_embeddings: 1024 # Context length (2048 OOM'd on 4090)
rms_norm_eps: 1.0e-5
data:
train: "data/pretokenized-2048-v2/train/" # Expanded v2 dataset (68K sequences)
val: "data/pretokenized-2048/val/"
batch_size: 4 # Micro-batch (batch=8 OOM'd)
seq_len: 1024
tokenizer: "models/albor-tokenizer-v2/tokenizer.json"
input_column: "input_ids" # Pre-tokenized: List<u32> column
optimizer:
name: "adamw"
lr: 3.0e-4
beta1: 0.9
beta2: 0.95
weight_decay: 0.1
training:
mode: "causal_lm"
epochs: 1 # C-TRAINCFG-001: steps_per_epoch=16994 >= 5000
# grad_clip: 1.0 # ALB-067: disabled (CPU-side L2 norm bottleneck)
lr_scheduler: "cosine"
warmup_steps: 500 # 10% of max_steps (C-TRAINCFG-001)
gradient_accumulation: 1 # ALB-066: per-sequence optimizer (no true accum in CUDA)
mixed_precision: "fp16"
output_dir: "./checkpoints/albor-base-350m-v2"
save_interval: 25
max_steps: 5000
Legacy v1 config (pretrain-350m.yaml) used 22K sequences with
gradient_accumulation: 128 and epochs: 117 — see ALB-060 for why
epochs: 1 was fatal with the original data size.
Note on YAML numeric formatting: YAML supports underscore notation natively
(32_768, 1_000_000) for human-readable large numbers. All albor configs use
this convention. For shorthand like 10B or 512K, see gap ALB-021.
6.3 Training Workflow (Plan/Apply)
# Step 1: Plan — validate config, estimate VRAM, show execution plan (no GPU)
apr train plan configs/train/pretrain-350m.yaml
# Step 2: Apply — execute the training run
apr train apply configs/train/pretrain-350m.yaml --seed 42
# Step 3: Resume if interrupted (apply with --resume)
apr train apply configs/train/pretrain-350m.yaml \
--resume checkpoints/albor-base-350m/checkpoint-step-5000.json \
--seed 42
Plan phase (apr train plan):
- Schema validation: required keys, correct types, valid enum values
- Architecture sanity:
hidden_sizedivisible bynum_attention_heads,num_kv_headsdividesnum_attention_heads - VRAM budget: computes model size + optimizer + activations, warns if > GPU capacity
- Data paths: confirms
train:andval:directories exist with Parquet/tokenized shards - Tokenizer: loads tokenizer, checks vocab size matches
model.vocab_size - Time estimate: estimated wall time based on model size and hardware
- Prints structured plan summary (see §1.5.2 for output format)
- No GPU, no writes, no network. Runs on CPU in seconds.
Apply phase (apr train apply):
- Reads the same YAML, builds a random-initialized
Transformerwith themodel:section architecture, runs the causal LM training loop via entrenar - Checkpoints every
save_intervalsteps — resumable on crash - No Rust code needed — just one config file
apr train validate is an alias for apr train plan --strict — schema-only
checking without resource estimation. Fast enough for CI.
6.4 GPU-Resident Training (CudaTransformerTrainer)
The CudaTransformerTrainer (ALB-040) keeps all 24 transformer blocks
GPU-resident, reducing PCIe transfers from ~16K/step to exactly 3:
Transfer 1 (H2D): embedding hidden states ~S×H×4 bytes
Transfer 2 (D2H): logits for cross-entropy ~S×V×4 bytes
Transfer 3 (H2D): grad_logits to GPU ~S×V×4 bytes
Each CudaTransformerBlock holds its own weights, AdamW optimizer states
(m + v), and shares a CudaGradWorkspace for forward/backward activation
buffers. The per-block interleaved backward+optimizer pattern overwrites
the shared workspace each layer — memory cost is O(1 block), not O(24 blocks)
for activations.
VRAM budget (actual, RTX 4090 24GB):
| Component | Memory |
|---|---|
| 24 blocks (weights + AdamW m + v) | ~5 GB |
| Shared workspace (activation/gradient buffers) | ~10-12 GB (depends on seq_len) |
| LM head (weights + AdamW + logits buffer) | ~1-2.5 GB |
| System (Xorg/desktop) | ~1 GB |
At seq_len=512, batch=4: fits comfortably (~18 GB used).
At seq_len=1024, batch=4: fits (~19.5 GB used).
At seq_len=2048, batch=4: OOM at LM head alloc (logits [4,2048,32768] too large).
At seq_len=2048, batch=8: OOM at block 21 upload.
Dogfooding results:
| Config | Steps | Loss | Time | Status |
|---|---|---|---|---|
| 50M quick (seq=512, batch=4) | 5 | 10.42→9.45 | ~10s | PASS (post ALB-059 fix) |
| 350M test (seq=512, batch=4) | 50 | 10.39→5.92 (best 5.53) | ~400s | PASS (post ALB-059 fix) |
| 350M full v1 (seq=1024, batch=4, accum=128) | 43/5000 | 10.39 flat | ~12s | FAIL (ALB-060): epochs=1 exhausted data |
| 350M full v2 (seq=1024, batch=4, accum=1) | 1183/5000 | 10.4→6.85 | ~1.4h | CRASHED: ALB-073 (PTX selp) + ALB-074 (stale binary). Step 1000 ckpt saved. |
| 350M v3 (seq=1024, batch=4, codeparrot) | 28K/250K | 10.40→6.43 | ~1.9 days | STOPPED (plateau): val_ppl=1018 at step 28K. 6.7K tok/s, 19.3% MFU. Plateau since step 12K — ALB-079 (no cosine decay) + ALB-080 (batch too small). |
| 350M v4 (seq=1024, batch=4, ga=32) | 500 | 10.40→5.76 | ~4.7h | Killed by system reboot at step 553. val_ppl=1032.7 at step 500 (matched v3 at 57% token budget). Checkpoint saved. |
| 350M v4-resume (from step 500) | 56+ | 10.40→6.31 | est ~2.7 days | RUNNING: Warm-start 8x faster convergence. loss=6.31 at step 37. |
ALB-060: Training Configuration Epoch/Step Mismatch (Critical)
The first 350M full training run (2026-03-02) ran only 43 of 5000 steps because
epochs: 1 caps total steps to floor(num_sequences / batch_size / grad_accum).
With 22,079 sequences, batch=4, accum=128: steps_per_epoch = 43. Warmup (2000
steps) never completed — LR peaked at 6.45e-6 vs target 3e-4. Loss stayed flat
at ~10.39 for all 43 steps (never exited warmup). Root cause: no pre-flight
algebraic validation of epoch/step consistency.
Fix: C-TRAINCFG-001 contract (contracts/training-config-kernel-v1.yaml) +
epochs: 117 for v1 data, or v2 config (pretrain-350m-v2.yaml) with expanded
dataset (67,977 sequences, epochs: 38, warmup_steps: 500).
Training stability contracts verified (ALB-044, ALB-059, ALB-060):
- C-EMBED-GRAD-001: Activation gradient clipped at GPU→CPU boundary
- C-HYPERPARAMS-001: All optimizer params flow from YAML config
- C-BUFSIZE-001: Buffer sizes algebraically verified (ALB-043 fix)
- C-GRADFLOW-001: All trainable parameters receive gradients (ALB-038 fix)
- C-GEMMARGS-001: GEMM backward constructor args match documented order (ALB-059 fix)
- C-GPUINIT-001: Optimizer states zero-initialized, not cuMemAlloc garbage (ALB-059 fix)
- C-STREAMSYNC-001:
stream.synchronize()before any D2H transfer reading kernel output (ALB-065 fix) - C-LOSSSCALE-001: fp16 loss scaling excluded from f32 backward path (ALB-072 fix)
- C-SELP-001: PTX
selp_f32argument order verified in all kernels (ALB-069, ALB-073 fixes) - C-EVALBUF-001:
eval_single_sequencetruncates to max_seq_len before GPU forward (ALB-074 fix) - C-GPUINIT-001: All optimizer m/v buffers zero-initialized (ALB-059 fix)
- C-LOSSSCALE-001: fp16 loss scaling excluded from GPU backward (all backward uses f32; scaling causes overflow) (ALB-072 fix)
- C-CUBLAS-NOTENCORE-001: cuBLAS uses CUBLAS_DEFAULT_MATH (no tensor cores) — tensor core algorithms produce NaN for transposed backward GEMMs at ~1e5 gradient magnitude (ALB-077 fix)
6.5 Checkpointing Strategy
| Aspect | Design |
|---|---|
| Format | SafeTensors (primary) + JSON metadata |
| Frequency | Every 1,000 steps (~1.2h at 4.2s/step, ~4M tokens) |
| Content | Model weights (~1.5 GB), optimizer state (~1.3 GB), config.json |
| Pruning | Automatic — keeps latest + best only, old checkpoints deleted |
| Disk usage | ~8.4 GB peak (3 checkpoints: current + best + in-flight) |
| Storage | Local NVMe RAID-0, checkpoints directory in repo |
| Resume | From latest checkpoint on crash (weights + optimizer state) |
| Export | apr publish --format safetensors for HuggingFace |
Checkpoint interval rationale (v3): save_interval: 1000 balances crash
recovery (~8.7min max lost work at 525ms/step) against I/O overhead (~3s per
checkpoint write vs ~525s between checkpoints = 0.6% overhead). With automatic
pruning, disk usage stays constant regardless of training length. For the
250K-step v3 run (~1.5 days at 7,579 tok/s), this yields 250 checkpoint events
with ~8.4 GB steady-state disk.
6.6 Experiment Tracking & Training Monitoring
entrenar has a full monitoring stack built in, and presentar provides rich terminal visualization. Albor uses both — no external tools (no W&B, no MLflow, no TensorBoard). Sovereign monitoring, sovereign visualization.
6.6.1 Monitoring Config: configs/train/pretrain-350m.yaml (monitoring section)
monitoring:
terminal:
enabled: true
refresh_rate: 1000 # TUI refresh in ms
metrics: ["loss", "learning_rate", "gradient_norm"]
charts:
- type: "loss_curve"
metric: "loss"
window: 100 # Smoothing window
show_eta: true
tracking:
enabled: true
backend: "sqlite" # .entrenar/experiments.db (WAL mode)
experiment: "albor-pretrain-350m"
tags:
model: "albor-350m"
stage: "pretrain"
data: "python-code-v2" # 139M tokens (v2 dataset)
system:
enabled: true
interval: 5000 # System metrics every 5s
metrics: ["gpu_utilization", "memory", "temperature"]
alerts:
- condition: "loss > 10"
action: "stop"
message: "Loss exploded — Andon stop"
- condition: "gradient_norm > 100"
action: "stop"
message: "Gradient explosion — Andon stop"
6.6.2 What Entrenar Monitors Automatically
| Component | What It Does | Already Built? |
|---|---|---|
MetricsCollector | Records loss, LR, gradient norms per step (SIMD-accelerated) | Yes (entrenar) |
ExperimentTracker | Tracks run_id, params, metrics, artifacts, status | Yes (entrenar) |
SqliteBackend | Durable experiment store: runs, params, metrics, artifacts in .entrenar/experiments.db (WAL mode) | Yes (entrenar) |
ProgressCallback | Kalman-filtered ETA, Unicode progress bars | Yes (entrenar) |
MonitorCallback | Integrates metrics into training, detects NaN/Inf → Andon alert | Yes (entrenar) |
CheckpointCallback | Saves best model + metadata (epoch, is_best, timestamp) | Yes (entrenar) |
EarlyStopping | Patience-based stopping on loss plateau | Yes (entrenar) |
Andon alerts | Toyota Way: Critical/Error/Warning/Info severity levels | Yes (entrenar) |
TuiMonitor | Detached terminal dashboard composing presentar widgets (ALB-057) | Yes (entrenar + presentar) |
DriftDetector | PSI, KS, Wasserstein distribution shift detection | Yes (entrenar) |
JsonFileStore | Real-time metrics to training_state.json (atomic writes) | Yes (entrenar) |
LossCurve widget | Training loss over epochs with EMA smoothing | Yes (presentar) |
ConfusionMatrix widget | Multi-class classification evaluation | Yes (presentar) |
Braille/Sparkline | High-resolution terminal charts (2x4 dots/cell, 8-level sparklines) | Yes (presentar) |
Heatmap widget | 2D matrix with CIELAB perceptual color gradients | Yes (presentar) |
6.6.3 Live Monitoring During Training
# Terminal 1 (lambda): Run training
apr train apply --task pretrain --config configs/train/pretrain-350m.yaml
# Terminal 2 (lambda or ssh): Attach live monitor (presentar TUI)
apr monitor ./checkpoints/albor-base-350m/
# Terminal 2 (alternative): JSON output for LLM agents / CI
apr monitor --json ./checkpoints/albor-base-350m/
# Discover all active training runs (reads global SQLite registry)
apr monitor
# List past experiments from SQLite registry
apr runs ls --global
# Show detailed metrics for a specific run
apr runs show <run-id> --global --json
# Browse past experiments from SQLite
apr experiment view --db .entrenar/experiments.db
# Compare loss curves across runs
apr experiment view --db .entrenar/experiments.db \
--runs albor-pretrain-50m,albor-pretrain-350m \
--metric loss --chart loss_curve
# One-shot profiler (GPU utilization, per-layer timing)
apr cbtop ./checkpoints/albor-base-350m/latest.safetensors
# Inference latency profiling
apr profile ./checkpoints/albor-base-350m/ --prompt "def fibonacci(n):"
# Stack-level health (from batuta)
batuta stack status
6.6.4 Experiment Lifecycle
Each training run creates two data streams:
Real-time (JSON file IPC) — for live TUI monitoring:
checkpoints/albor-base-350m/
├── training_state.json # Live metrics (loss, lr, grad_norm, GPU telemetry)
├── checkpoint-step-1000.safetensors
├── checkpoint-step-1000.json # Checkpoint metadata (epoch, is_best)
├── checkpoint-step-2000.safetensors
├── checkpoint-step-2000.json
├── checkpoint-best.safetensors
└── checkpoint-best.json
Durable (dual SQLite experiment stores) — for post-hoc analysis and comparison:
checkpoints/albor-base-350m/.entrenar/
└── experiments.db # Local per-experiment store (WAL mode)
├── experiments # Experiment metadata (name, description, config)
├── runs # Training runs (status, timestamps)
├── params # Hyperparameters (key/value/type)
├── metrics # Per-step metrics (loss, lr, tok/s, timestamp)
├── artifacts # Model artifacts (path, size, SHA-256)
└── span_ids # Distributed trace integration
~/.entrenar/
└── experiments.db # Global cross-machine registry (WAL mode)
└── (same schema) # All runs across all experiments
PretrainTracker (ALB-055/056) writes to both stores on every log interval.
All operations are best-effort — storage failures never block training.
Three consumers, zero contention:
apr monitorreadstraining_state.json(atomic write-then-rename) for live dashboards. Multiple monitors attach simultaneously.apr runs lsreads~/.entrenar/experiments.db(global registry) for cross-experiment history. Supports--jsonfor LLM agent consumption.apr experimentreads local.entrenar/experiments.db(WAL mode) for per-run metric queries and artifact tracking. Read-only during training — no lock contention with the writer.
6.6.5 Presentar Visualization: Rich Terminal Dashboards
presentar (presentar-terminal) provides ML-specific visualization widgets
that entrenar’s TrainingDashboard now composes directly (ALB-057). The
dashboard builds a widget tree from Layout::rows() of Border-wrapped
section panels, each containing Meter, GpuPanel, Sparkline, or Text
widgets. The connection point for historical data is entrenar’s SQLite
experiment store (.entrenar/experiments.db).
Live training dashboard (apr monitor — reads training_state.json):
╭─ Albor Pre-Train: albor-base-350m ─── Step 12,847 / 19,073 ──── 67.4% ─╮
│ │
│ Loss GPU (RTX 4090) │
│ 3.2 ⣀⣀ ████████████░░░ 82% │
│ ⠈⠉⠉⠑⠒⠒⠤⣀ VRAM: 14.2 / 24.0 GB │
│ ⠈⠉⠑⠒⠤⣀⣀ Temp: 72°C │
│ 1.8 ⠈⠉⠒⠒⣀⣀⣀⣀ Power: 312W │
│ ⠉⠉⠉ Tokens/s: 18,432 │
│ 0 ──────────────────────────────── 12K │
│ │
│ Learning Rate Gradient Norm ETA: 1d 14h 22m │
│ ⣿⣿⣿⣷⣶⣶⣤⣤⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ▁▁▂▁▁▃▁▂▁▁▁▂▁▁ Throughput: 5.2B / 10B │
│ 3e-4 → 2.1e-4 0.42 (norm) Checkpoint: step-12000 │
╰──────────────────────────────────────────────────────────────────────────╯
Post-hoc experiment comparison (apr experiment view — reads SQLite):
# Compare loss curves across all pre-training runs
apr experiment view --db .entrenar/experiments.db \
--runs albor-pretrain-50m,albor-pretrain-350m \
--metric loss --chart loss_curve
# Hyperparameter comparison table
apr experiment view --db .entrenar/experiments.db \
--experiment albor-pretrain-350m --params
# Export metrics for external analysis (Parquet for alimentar)
apr experiment export --db .entrenar/experiments.db \
--run albor-pretrain-350m --format parquet --output ./eval/metrics.parquet
Presentar widgets used by albor:
| Widget | Use Case | Data Source |
|---|---|---|
LossCurve | Training loss over steps with EMA smoothing | training_state.json (live) or SQLite metrics table (post-hoc) |
Sparkline | Compact LR schedule, gradient norm history | training_state.json lr_history, grad_norm |
Heatmap | Attention pattern visualization, weight distribution | Model checkpoint tensors |
Gauge | GPU utilization, VRAM usage, training progress | training_state.json gpu telemetry |
BrailleGraph | High-resolution loss/metric curves over SSH | training_state.json loss_history |
Histogram | Weight distribution per layer (pre/post distillation) | Model checkpoint tensors |
BarChart | Benchmark scores across model stages | eval/*.json results |
Two rendering targets, same widgets, same data:
presentar compiles the same widget tree to two targets — terminal and
WASM. The dashboard YAML is written once. presentar-terminal renders it
via crossterm (works over SSH). presentar renders it via WebGPU in the
browser (60fps, GPU-accelerated). Both read from the same data sources.
| Mode | Command | Renderer | Data Source | Use Case |
|---|---|---|---|---|
| Live TUI | apr monitor ./checkpoints/ | presentar-terminal (crossterm) | training_state.json (polling) | Watch training over SSH |
| Experiment TUI | apr experiment view | presentar-terminal (crossterm) | SQLite .entrenar/experiments.db | Compare runs in terminal |
| Web dashboard | presentar serve --config albor-dashboard.yaml | presentar (WebGPU/WASM) | SQLite + checkpoints | Rich browser dashboard |
Both TUI and WASM are first-class deliverables, not stretch goals. The terminal TUI is the primary interface (SSH to lambda/intel). The WASM dashboard is the shareable artifact for model cards and teaching.
6.6.6 No External Dependencies
| What Others Use | What Albor Uses Instead | Why |
|---|---|---|
| Weights & Biases | entrenar SqliteBackend + presentar dashboards | Sovereign — no cloud, no API keys, all data local |
| TensorBoard | presentar LossCurve + BrailleGraph over SSH | No Python, no browser required, works over SSH |
| MLflow | entrenar ExperimentTracker + SQLite + apr experiment | Self-hosted SQLite, no server process, query via CLI |
| nvidia-smi polling | entrenar system metrics + apr cbtop | Integrated into training loop, not bolted on |
| Streamlit dashboards | presentar WASM dashboard (10x faster rendering) | GPU-accelerated, 60fps, zero Python |