Albor LLM Specification

Version: 0.6.0 Date: 2026-03-03 Status: Phase 3 — 350M Base Model Retraining (ALB-060 fix, v2 data) Author: Noah Gift / Pragmatic AI Labs

Albor (Spanish: “dawn”) — A sovereign Python code completion model trained from first principles using only the Sovereign AI stack. Python-only following the phi-1 playbook: maximum concentration on one language, distilled from Qwen3-Coder-Next (80B), then optimized through fine-tuning, merging, pruning, and quantization into a fast, local, zero-dependency code completion engine. The goal is twofold: produce a usable Python code assist model that runs anywhere Rust compiles, and identify + fix every gap in the stack that blocks end-to-end LLM development.

Latest milestone: 350M CUDA test training verified — 50 steps, loss 10.39→5.92 (best 5.53), checkpoint loads in realizar, all training stability contracts pass. First full training run failed (ALB-060: epochs=1 only ran 43/5000 steps). Fixed with C-TRAINCFG-001 contract + v2 config (67,977 sequences, 139M tokens, epochs=38). Qwen2.5-Coder-3B interim teacher validated for distillation. 24+ upstream gaps fixed across 8 sovereign stack components.

1. Objectives

1.1 Primary Goal

Train, distill, and optimize a 350M-parameter decoder-only transformer using exclusively the Sovereign AI stack:

apr for training, distillation, merging, pruning, quantization, eval, export
alimentar for data loading and preprocessing
forjar for pipeline orchestration (DAG engine, multi-machine, state tracking)
bashrs (Rash) for shell fragment validation in pipeline task resources
repartir for distributed compute
entrenar for the training engine (autograd, optimizers, checkpointing)
trueno for SIMD/GPU tensor operations
realizar for inference (teacher model, eval, serving)
presentar for training visualization (TUI dashboards, experiment browser, WASM)
batuta for orchestration, stack coordination, and falsification
pv (provable-contracts) for design-by-contract verification of every kernel
pmat for TDG scoring, compliance, fault pattern analysis, and coverage gaps
certeza for three-tier test effectiveness (unit → property → formal)

1.2 Secondary Goal (Stack Validation)

Identify every implementation gap that blocks the primary goal. Fix each gap in the correct upstream component. The model is the proof; the stack improvements are the lasting value.

1.3 Multi-Stage Improvement Ladder

The model is not a single training run — it is iteratively improved through every post-training technique available in apr. Each stage exercises a different part of the stack, produces a benchmarked checkpoint, and may reveal new gaps.

Stage 1: Pre-train base model         → albor-base
Stage 2: Distill from Qwen3-Coder-Next → albor-distill
Stage 3: Instruction fine-tune (LoRA)  → albor-instruct
Stage 4: Merge with complementary model → albor-merged
Stage 5: Prune for efficiency          → albor-pruned
Stage 6: Quantize for deployment       → albor-q4

1.4 Target Use Cases

Primary: Sovereign Code Assist

A tiny, fast, zero-dependency code completion model that runs entirely locally. No API calls, no Python runtime, no telemetry, no cloud. Distillation from Qwen3-Coder-Next gives it coding capability far above what 350M parameters normally achieve.

Capability	Description
Python code completion	Left-to-right next-token prediction in `.py` files
Fill-in-the-middle (FIM)	Insert Python code between existing prefix and suffix (PSM/SPM)
Single-line infill	Complete the current line given surrounding context
Multi-line body generation	Generate function bodies, loop contents, comprehensions, decorators
On-device inference	Runs on laptops, Raspberry Pi, browsers (WASM via trueno)
Latency target	<50ms per token on CPU (Q4), <10ms on GPU

Language: Python only. Following the phi-1 playbook — maximum concentration on a single language produces dramatically better results at small param counts than spreading tokens across many languages. A 350M model that completes Python well is more useful than a 350M model that completes 10 languages poorly.

What Albor is NOT: It is not a chat model, not an instruction follower, not a reasoning engine, not a polyglot code model. It is a fast, local Python code completion kernel — the kind of model that lives inside an editor extension and fires on every keystroke.

Secondary: Stack Demonstration & Teaching Artifact

The model exists equally to prove the Sovereign AI stack can train, distill, optimize, and serve an LLM end-to-end in pure Rust. The HuggingFace model card is a tour of the stack. The reproducibility protocol means anyone can retrain from scratch using only apr commands.

Audience	What They Get
Developers	A code completion model they can self-host with zero dependencies
Researchers	A fully reproducible training recipe with provable quality contracts
Stack users	Proof that aprender/entrenar/trueno/realizar handle real LLM workloads
Educators	A case study in first-principles LLM training (data → deploy in Rust)

1.5 What Albor Builds

Albor is a project repo, not a library. It contains no production Rust code. All Rust changes happen upstream in the sovereign stack components. Albor drives the upstream work, validates it end-to-end, and produces the model.

1.5.1 What Lives in Albor (This Repo)

albor/
├── docs/
│   ├── specifications/albor-llm-spec.md    # This spec
│   ├── model-card.md                       # HuggingFace model card
│   └── falsification-report.md             # batuta falsify output
├── configs/
│   ├── train/
│   │   ├── pretrain-50m.yaml              # 50M: model arch + training (pipeline validation)
│   │   ├── pretrain-125m.yaml             # 125M: model arch + training (intermediate)
│   │   ├── pretrain-350m.yaml             # 350M: model arch + training (final)
│   │   ├── distill.yaml                   # Distillation config
│   │   └── finetune-lora.yaml             # LoRA fine-tuning config
│   ├── pipeline/
│   │   └── albor.yaml                      # THE manifest: infra + data + train + eval + publish
│   ├── dashboard/
│   │   └── albor-dashboard.yaml            # presentar dashboard (TUI + WASM)
│   └── data-mix.yaml                       # Data source weights + upsampling
├── contracts/
│   ├── knowledge-distillation-kernel-v1.yaml  # ALB-013
│   ├── bpe-tokenizer-kernel-v1.yaml           # ALB-014
│   ├── model-merging-kernel-v1.yaml           # ALB-015
│   ├── pruning-kernel-v1.yaml                 # ALB-016
│   └── gradient-accumulation-kernel-v1.yaml   # ALB-017
├── tests/
│   ├── falsify/                            # FALSIFY-ALBOR-001 through 009
│   ├── integration/                        # End-to-end pipeline tests
│   └── smoke/                              # Quick sanity checks (50M model)
├── state/                                  # (gitignored) forjar state + locks
│   ├── lambda/state.lock.yaml              # Per-machine resource state
│   ├── intel/state.lock.yaml
│   └── forjar.lock.yaml                    # Global pipeline state
├── data/                                   # (gitignored) Training data
├── checkpoints/                            # (gitignored) Model checkpoints
└── eval/                                   # (gitignored) Evaluation results

1.5.2 `apr` as Unified Entry Point

apr is the single CLI for all model operations. It delegates to sibling projects (entrenar, alimentar, realizar, etc.) under the hood. If a subcommand doesn’t exist yet, we file a GitHub issue, implement it in the correct upstream repo, wire it into apr, dogfood it in albor, and close the issue.

Design Principle: Plan/Apply Everywhere

Every apr subcommand that touches data, compute, or infrastructure follows a plan/apply contract inspired by Terraform and forjar:

plan   → Validate inputs, estimate cost, show what WILL happen. No side effects.
apply  → Execute the plan. Mutates state (files, models, infrastructure).

This is not optional. It is the unifying design principle of the CLI. Every expensive operation gets a free dry-run. Every destructive operation shows you what it will do before it does it. Users never commit GPU hours, disk space, or network bandwidth without seeing the plan first.

The contract:

apr <cmd> plan <config> — Parse config, validate paths, estimate resources (VRAM, disk, time, tokens), print a human-readable execution plan. Exit 0 if valid, exit 1 with diagnostics if not. No GPU, no writes, no network.
apr <cmd> apply <config> — Execute. Reads the same config, does the work. Can be interrupted and resumed.
apr <cmd> validate <config> — Alias for plan with --strict schema-only checking (no resource estimation). Fast enough for CI.

Why this matters for albor: Training a 350M model for 7 days on a 4090 is not something you retry casually. A config typo caught at plan time saves days. A VRAM overestimate caught at plan time prevents OOM crashes at step 15,000. Plan/apply turns “hope it works” into “prove it will work, then run it.”

Dispatch Table

apr <subcommand>
├── pipeline plan/apply      → forjar DAG engine (THE entry point — runs everything)
├── tokenize plan/apply      → aprender BPE tokenizer
├── train plan/apply         → entrenar TransformerTrainer
├── distill plan/apply       → entrenar + realizar (precompute + student training)
├── finetune plan/apply      → entrenar LoRA/QLoRA
├── eval plan/apply          → aprender eval harness
├── merge plan/apply         → entrenar SLERP/TIES/DARE
├── prune plan/apply         → entrenar WANDA/magnitude
├── quantize plan/apply      → entrenar Q4/Q8
├── export plan/apply        → entrenar SafeTensors/GGUF
├── publish plan/apply       → entrenar HuggingFace Hub
├── bench plan/apply         → realizar latency benchmarks
├── provision plan/apply     → forjar infrastructure convergence
├── experiment view/export   → presentar TUI + entrenar SQLite
└── monitor                  → presentar live TUI (reads training_state.json)

apr pipeline is the top-level command. It reads a single YAML manifest that describes infrastructure resources AND training tasks in one DAG. Forjar’s engine resolves dependencies (Kahn’s toposort), tracks state (BLAKE3 hashes), and dispatches each step — calling back into apr subcommands for ML tasks. Individual subcommands (apr train, apr eval, etc.) still work standalone for development and debugging.

Plan Output Format

Every plan subcommand prints a structured summary:

$ apr train plan configs/train/pretrain-350m.yaml

  Albor Train Plan
  ─────────────────────────────────────────────
  Model:        llama (24L, 1024H, 16A, 4KV)
  Parameters:   354,267,136 (~354M)
  Precision:    fp16 mixed
  ─────────────────────────────────────────────
  VRAM Budget:
    Weights       700 MB
    Optimizer   2,800 MB   (AdamW fp32 m+v)
    Gradients     700 MB
    Activations 9,200 MB   (grad ckpt, batch=8, seq=2048)
    Total      13,400 MB   (55.8% of 24,576 MB)
    Headroom   11,176 MB   ✓
  ─────────────────────────────────────────────
  Data:
    Train shards  data/tokenized/train/ (47 files, 8.2 GB)
    Val shards    data/tokenized/val/   (3 files, 410 MB)
    Tokenizer     models/albor-tokenizer/tokenizer.json ✓
    Vocab match   32,768 = model.vocab_size ✓
  ─────────────────────────────────────────────
  Training:
    Global batch  524,288 tokens (8 × 32 × 2048)
    Total tokens  10,000,000,000 (~10B)
    Total steps   19,073
    Warmup        2,000 steps (10.5%)
    Checkpoints   19 (every 1,000 steps)
    Disk est.     ~13.3 GB (19 × 700 MB)
  ─────────────────────────────────────────────
  Estimated wall time: 5.2 days on RTX 4090
  ─────────────────────────────────────────────
  ✓ Plan valid. Run `apr train apply configs/train/pretrain-350m.yaml` to start.

Forjar already does this (forjar plan -f albor.yaml). Entrenar has the TrainingPlan module (training_plan.rs) that mirrors forjar’s architecture. Albor’s job is to close the loop: every apr subcommand gets plan/apply, and every gap (ALB-XXX) that adds a new subcommand must implement both phases.

What Plan Validates Per Subcommand

Subcommand	Plan Checks
`tokenize`	Input Parquet exists, vocab size valid, output dir writable, estimated time
`train`	YAML schema, model arch sanity (divisibility, KV ratio), VRAM budget, data paths, tokenizer vocab match, checkpoint disk estimate
`distill`	Teacher model loadable (RAM check), student checkpoint exists, logit output dir writable, temperature/alpha valid
`finetune`	Base model exists, LoRA rank/alpha valid, dataset format, VRAM with adapters
`eval`	Model checkpoint exists, benchmark tasks recognized, output dir writable
`merge`	All input models exist and have compatible architectures, merge method valid
`prune`	Model exists, sparsity ratio in [0,1], method recognized, output size estimate
`quantize`	Model exists, target format valid (Q4/Q8), output size estimate
`export`	Model exists, format valid (SafeTensors/GGUF), output path writable
`publish`	Model + model card exist, HF token valid, repo name available
`provision`	forjar plan: SSH reachable, packages installable, GPU drivers, disk space

1.5.3 Development Workflow: Issue-Driven Dogfooding

When albor hits a wall — a missing subcommand, a broken feature, a gap in a sibling project — the workflow is:

1. Hit wall       → apr <subcommand> doesn't exist or fails
2. File issue     → GitHub issue on correct repo (aprender, entrenar, alimentar, etc.)
3. Implement      → Fix upstream in the correct component
4. Wire into apr  → Add/update apr subcommand if needed
5. Dogfood        → Run the blocked albor pipeline step
6. Prove          → Tests pass, FALSIFY test passes, pmat comply check
7. Close issue    → Link to albor gap ID (ALB-XXX)

Every ALB-XXX gap in the gap register (§11) maps to a GitHub issue. The gap is not “closed” until the apr subcommand works end-to-end in the albor pipeline.

1.5.4 What Lives Upstream (Other Repos)

Upstream Repo	What Albor Adds to It	Gaps
aprender (apr)	`pipeline plan/apply`, `tokenize plan/apply`, `distill plan/apply`, `eval plan/apply`, `train plan/apply`, plan/apply contract enforcement	ALB-001, 006, 009, 011, 023, 028
alimentar	`import local`, `mix` with upsampling, FIM transforms, streaming to entrenar	ALB-007, 018, 019, 020
realizar	Qwen3-Coder-Next / DeltaNet / MoE architecture support	ALB-010
entrenar	Training engine, model merging, pruning, quantization, LoRA, custom YAML model arch, human-readable config values	ALB-003, 004, 021, 022
forjar	`task` resource type for ML pipeline orchestration, DAG engine for `apr pipeline`	ALB-027
presentar	SQLite experiment viewer, live training TUI, WASM dashboard, `apr experiment` CLI	ALB-024, 025, 026
bashrs	Shell fragment validation for all `task` resource `command:` fields	(used by ALB-027)
trueno	wgpu backward pass (stretch)	ALB-005
repartir	Ring all-reduce (stretch), heterogeneous balancing	ALB-002, 008
provable-contracts	5 new kernel contracts (KD, BPE, merging, pruning, grad accum)	ALB-013–017

1.5.5 Where Quality Constraints Apply

Constraint	Applies To	NOT To
95% test coverage	Upstream Rust code we modify (aprender, entrenar, alimentar, etc.)	Albor’s shell scripts and YAML configs
85% mutation score	Upstream Rust code we modify	Albor configs
500-line file limit	ALL files: upstream Rust, albor scripts, YAML configs, contracts	Generated output (eval results, logs)
TDG grade A	Upstream Rust code via `pmat`	Albor shell scripts
Zero clippy warnings	Upstream Rust code	N/A
pmat comply check	Each upstream repo after modification	Albor repo itself
Contract verification	Upstream kernel implementations	Albor orchestration
FALSIFY-ALBOR tests	The albor pipeline end-to-end	Individual upstream unit tests

The albor repo has no Rust code to cover. Its quality is measured by:

Do the configs work? (integration tests)
Do the FALSIFY tests pass? (end-to-end validation)
Are the contracts complete? (pv status)
Does the pipeline reproduce? (deterministic re-run)

1.6 Constraints

Zero Python dependencies — Pure Rust from data to deployment
Scientifically reproducible — Fixed seeds, versioned data, deterministic training
Publicly auditable — All data, code, hyperparameters, and training logs published
apr only — Every model operation uses an apr <subcommand>. Missing commands are gaps to implement.
Plan/apply everywhere — Every apr subcommand implements plan (dry-run, no side effects) and apply (execute). No GPU time without a passing plan.
One manifest, one DAG — apr pipeline plan/apply configs/pipeline/albor.yaml orchestrates the entire pipeline. No Makefiles, no shell scripts. Forjar’s DAG engine handles dependency resolution, state tracking, multi-machine dispatch, and resumability.
bashrs linted — All shell fragments in forjar task resources are validated by bashrs (Rash). No unvalidated shell.
No file over 500 lines — Applies to all code, scripts, configs, and contracts (not docs/specs)
Provably correct — Every kernel has a YAML contract with falsification tests and Kani proofs
pmat compliant — Upstream changes: TDG grade A, 95% coverage, 85% mutation score, zero SATD
Falsifiable — Every claim in this spec has a concrete test that could disprove it

1.7 Sovereign Stack vs. Standard ML Stack

Most LLM training stacks depend on a deep tower of NVIDIA and Python libraries:

Standard ML Stack              Sovereign Stack (albor)
─────────────────              ──────────────────────
Python                         Rust (no Python runtime)
PyTorch / JAX                  entrenar (training engine)
cuDNN                          trueno PTX kernels + cuBLAS FFI
NCCL                           (not needed — single GPU)
torch.distributed              repartir (stretch goal)
Weights & Biases               presentar + renacer tracing
HuggingFace Transformers       realizar (inference)

What each replaced component does — and why we don’t use it:

Component	What It Does	Why Albor Doesn’t Use It
PyTorch	Autograd, tensor ops, training loop	entrenar implements autograd, AdamW, checkpointing in Rust. No Python GIL, no dynamic graph overhead.
cuDNN	Optimized GPU kernels for conv, norm, attention	trueno provides hand-written PTX kernels (RMSNorm, SiLU, softmax, cross-entropy) and cuBLAS FFI for GEMM. Every kernel has a provable contract.
NCCL	Multi-GPU collective communication (all-reduce, broadcast, scatter)	Albor trains on a single RTX 4090. No multi-GPU communication needed. For future multi-GPU work, repartir would implement ring all-reduce directly.
torch.distributed	Distributed training orchestration (DDP, FSDP)	Single-GPU training. The model (370M params, ~1.5 GB) fits entirely in 24 GB VRAM with optimizer states.
Weights & Biases	Experiment tracking, dashboards	renacer provides structured tracing with BrickTracer spans. presentar provides TUI dashboards and WASM visualization.

The GPU interface: The sovereign stack talks to NVIDIA hardware through two interfaces only:

CUDA Driver API (libcuda.so) — Memory allocation, kernel launch, stream management, device queries. This is the lowest stable NVIDIA API. trueno binds it directly via Rust FFI — no CUDA Runtime API (libcudart) dependency.
cuBLAS (libcublas.so) — Matrix multiplication (GEMM). The only NVIDIA library used for compute. trueno wraps it with a safe Rust API (CublasHandle, CublasGemm) that enforces correct argument order at the type level. cuBLAS replaced hand-written PTX GEMMs in ALB-075, improving throughput from 890 tok/s to 6,700 tok/s (7.5x).

What this means in practice: The entire training binary is a single statically-linked Rust executable (~15 MB). It has no Python interpreter, no pip packages, no conda environment, no Docker container, no version conflicts between PyTorch and CUDA toolkit. cargo build --release produces a binary that runs training. The only runtime dependencies are libcuda.so (NVIDIA driver) and libcublas.so (ships with the driver).

2. Hardware Inventory

2.1 Machine: `lambda` (Threadripper)

Property	Value
CPU	AMD Threadripper (high core count)
GPU	NVIDIA RTX 4090 (24 GB GDDR6X)
GPU Backend	CUDA 12.x
FP32 TFLOPS	82.6
FP16 TFLOPS	165 (with tensor cores)
Role	Primary trainer, student model
Measured MFU	21.9% (350M, seq=1024, cuBLAS SIMD, no tensor cores)
Measured tok/s	7,579 (350M, seq=1024, batch=4)

2.2 Machine: `intel` (Mac Pro 2019 chassis, Linux)

Property	Value
CPU	Intel Xeon W-3245 @ 3.20 GHz (16C/32T)
RAM	~300 GB
GPU	2x AMD Radeon Pro W5700X (8 GB GDDR6 each)
GPU Backend	wgpu/Vulkan (ROCm unsupported for RDNA 1 / gfx1010)
FP32 TFLOPS	~9 per card (~18 total)
Role	Teacher inference (Qwen3-Coder-Next in CPU RAM), data pipeline, eval

2.3 Network

SSH connectivity (ssh intel) with ControlMaster multiplexing (forjar FJ-252)
LAN bandwidth assumed ≥1 Gbps

2.4 Key Insight: 300 GB RAM Enables CPU-Based Teacher Inference

The intel box’s 300 GB RAM fundamentally changes the distillation architecture. Qwen3-Coder-Next (80B params) fits entirely in CPU RAM:

Model Format	Size in RAM	Fits in 300 GB?	Headroom
fp16	~160 GB	Yes	~140 GB for KV cache + buffers
Q8	~80 GB	Easily	~220 GB
Q4	~40 GB	Trivially	~260 GB

No quantization-induced quality loss needed. The teacher runs at full fp16 precision, producing the highest-quality soft targets for distillation.

3. Model Architecture

3.1 Architecture: LLaMA-Style Decoder-Only Transformer

entrenar’s transformer is a pre-norm LLaMA-style architecture with RMSNorm, SwiGLU FFN, Grouped-Query Attention, and RoPE. This is hardcoded in the Transformer struct — we configure it via YAML, we don’t build it from scratch.

Hyperparameter	Value	Rationale
Parameters	~350M	Fits in 4090 VRAM with optimizer state in fp16
Layers	24	GPT-2 Medium proven at this depth
Hidden dim (d_model)	1024	Standard for this param count
Attention heads	16	d_head = 64, well-studied
KV heads	4	GQA with 4:1 ratio (memory efficient)
FFN dim (intermediate)	4096	~4x hidden dim (SwiGLU gate + up + down)
Vocab size	32,768	BPE trained on corpus (power of 2 for GPU efficiency)
Context length	2048 (spec) / 1024 (training)	2048 OOMs at batch≥4 on 4090; training uses 1024
Position encoding	RoPE	Built into entrenar’s `MultiHeadAttention`
Attention	GQA	Built into entrenar, fewer KV heads than Q heads
Normalization	RMSNorm	Built into entrenar, pre-norm (before attn + FFN)
FFN activation	SwiGLU	Built into entrenar (gate_proj, up_proj, down_proj)
Dropout	0.0	Modern practice for pre-training (regularize via data)

3.2 Progressive Model Sizing

To validate the pipeline quickly, we train progressively larger models. Each gets its own YAML config file (see §6.2 for full config format).

Model	Config	Params	Layers	Hidden	Heads	Purpose
albor-50M	`pretrain-50m.yaml`	~50M	12	512	8	Pipeline validation (hours)
albor-125M	`pretrain-125m.yaml`	~125M	16	768	12	Intermediate, first benchmarks (1-2 days)
albor-350M	`pretrain-350m.yaml`	~350M	24	1024	16	Final base model (3-7 days)

The 50M model proves the entire stack works end-to-end before committing days of GPU time to the 350M run.

3.3 VRAM Budget (fp16 mixed precision, RTX 4090)

Speculative estimates (pre-dogfooding):

Component	Size
Model weights (fp16)	~700 MB
Adam optimizer states (fp32 m, v)	~2.8 GB
Gradients (fp16)	~700 MB
Activations (grad checkpoint, batch=8, seq=2048)	~8-12 GB
Total estimated	~13-16 GB

Actual measurements (from ALB-040 dogfooding with CudaTransformerTrainer):

Config	VRAM Used	Status
seq=512, batch=4	~18 GB	PASS
seq=1024, batch=4	~19.5 GB	PASS (production config)
seq=2048, batch=4	OOM	FAIL — logits [4,2048,32768] = 1 GB exceeds budget
seq=2048, batch=8	OOM	FAIL — OOM at block 21 upload

The GPU-resident CudaTransformerTrainer keeps all 24 blocks in VRAM (weights + AdamW states ≈ 5 GB) plus a shared workspace for activations (~10-12 GB). This is tighter than the speculative estimate because the shared workspace includes attention score matrices that scale as O(heads × seq² × batch). Batch size is fixed at 4. Note: gradient_accumulation is set to 1 for the v2 config, though per-block CPU gradient accumulation is now fully implemented via PerBlockGradientAccumulator (D2H download, CPU averaging, H2D upload). See §6.4 for detailed breakdown.

4. Distillation Teacher: Qwen3.5-35B-A3B

4.1 Teacher Model Profile

Property	Value
Model	Qwen3.5-35B-A3B
Parameters	35B total, 3B active per token (MoE)
Architecture	Hybrid: 30 Gated DeltaNet + 10 full GQA layers, MoE FFN (256 experts, top-8 + 1 shared)
Hidden dim	2048, head_dim=256, 16 Q heads, 2 KV heads
Layers	40 (pattern: 3 linear + 1 full attention, repeating)
Expert FFN	SwiGLU, intermediate_size=512 per expert
Context	262K tokens (extensible to ~1M via YaRN)
License	Apache 2.0
Specialization	Code generation, agentic reasoning

4.2 Why This Teacher

Apache 2.0: Legally clean for distillation, no license contamination
35B knowledge at 3B cost: MoE activates only 8+1 experts per token. Inference FLOP budget matches a dense 1.8B model, but the 256 experts collectively encode 35B parameters of knowledge. Soft targets are far richer than a dense 3B teacher.
Fits on a single 4090: At Q4 quantization, weights occupy ~17.5 GB. With activations and KV cache (only 10 full-attention layers need KV cache), total VRAM is ~18.3 GB — leaving 5.7 GB headroom on 24 GB.
Coding focus: Distilled student inherits strong code capabilities, making it competitive on HumanEval/MBPP — benchmarks where tiny models normally fail.
realizar already supports most of the architecture: Gated DeltaNet linear attention (GH-278), SwiGLU FFN, GQA, hybrid layer_types config, and MoE routing (CapacityFactorRouter, PowerOfTwoChoicesRouter) all exist. The missing pieces are expert weight loading and dispatch integration.
Novel architecture (DeltaNet + MoE): Exercising realizar’s model loading on a non-standard architecture is exactly the kind of gap-finding that validates the stack.

4.2.1 VRAM Budget (Q4, batch=1, seq=2048)

Component	Size	Notes
Weights (Q4)	17.5 GB	35B params × 0.5 bytes/param
KV cache (10 layers)	0.08 GB	Only full-attention layers (every 4th)
Activations (40 layers)	0.67 GB	hidden=2048, single-token inference
Router logits	0.08 GB	2048 × 256 experts × f32
Total	18.3 GB	5.7 GB headroom on RTX 4090

4.2.2 Realizar MoE Readiness Assessment

Component	Status	Location
MoE routing (2 strategies)	Exists	`src/moe/mod.rs`
Gated DeltaNet linear attention	Exists (GH-278)	`src/gpu/scheduler/types.rs`
SwiGLU FFN	Exists	`src/gpu/scheduler/forward_block.rs`
GQA attention	Exists	`src/gpu/scheduler/forward_block.rs`
Hybrid `layer_types` config	Exists	`types.rs` `is_linear_layer()`
Safetensors loading	Exists	`src/safetensors/`
Expert weight struct	Missing	Add `MoeExpertWeights` to `BlockWeights`
Router gate loading	Missing	Load `mlp.gate.weight` [256, 2048]
Expert dispatch	Missing	softmax → top-8 → SwiGLU × 8 → weighted sum
Shared expert	Missing	Always-on SwiGLU, separate gate/up/down
Fused gate_up_proj	Missing	Unfuse [256, 1024, 2048] tensor

Estimated new code: ~300-400 lines in realizar for full MoE inference.

4.3 Distillation Architecture

Primary path: GPU-resident teacher inference on lambda (RTX 4090). The 35B model at Q4 fits in 18.3 GB VRAM — teacher inference and logit caching run on the same machine as student training.

┌─────────────────────────────────────────────────────────────────────────┐
│  lambda (RTX 4090, 24 GB)                                              │
│                                                                         │
│  Phase 1: Pre-compute teacher logits (GPU, ~18.3 GB)                   │
│  ┌──────────────────────────┐     Parquet shards      ┌──────────────┐ │
│  │ Qwen3.5-35B-A3B (Q4)    │ ──────────────────────► │ teacher_logits│ │
│  │ realizar MoE inference   │    top-k=128 logits     │ ~50-100 GB   │ │
│  │ 18.3 GB VRAM             │                          └──────────────┘ │
│  └──────────────────────────┘                                           │
│                                                                         │
│  Phase 2: Train student (GPU, ~5 GB)                                   │
│  ┌──────────────────────────┐     ┌─────────────────────────────────┐  │
│  │ Student: albor-350M      │ ◄── │ Pre-computed logits + train data │  │
│  │ KD loss + CE loss        │     │ (loaded from disk at GPU speed)  │  │
│  │ entrenar distill         │     └─────────────────────────────────┘  │
│  └──────────────────────────┘                                           │
└─────────────────────────────────────────────────────────────────────────┘

Fallback path: If GPU VRAM is tight (teacher + student simultaneously), pre-compute logits on CPU. Intel box (300 GB RAM) can run the 35B model at Q4 (~18 GB RAM) or Q8 (~35 GB) with ~5-15 tok/s throughput.

4.4 Pre-Computed Logits Strategy

Teacher and student do NOT run simultaneously. We pre-compute teacher logits offline, then train the student from cached logits at full GPU speed:

Lambda runs Qwen3.5-35B-A3B inference (Q4, GPU) on all training data
Teacher top-k logits (k=128) saved as sharded Parquet via alimentar
Student training loads pre-computed logits from disk — no teacher in VRAM
Sequential phases = no VRAM contention

# Step 0: Plan — check teacher fits, estimate logit disk usage
apr distill plan configs/train/distill.yaml

# Step 1: Pre-compute teacher logits on lambda GPU (Q4, ~18.3 GB)
apr distill apply configs/train/distill.yaml --stage precompute

# Step 2: Train student on lambda using pre-computed logits (~5 GB)
apr distill apply configs/train/distill.yaml --stage train --seed 42

Estimated teacher throughput (Qwen3.5-35B-A3B):

Device	Quantization	VRAM/RAM	Throughput	500M tokens
RTX 4090 (GPU)	Q4	18.3 GB	~50-100 tok/s	~1.5-3 days
Xeon 48T (CPU)	Q4	~18 GB	~5-15 tok/s	~10-30 days
Xeon 48T (CPU)	Q8	~35 GB	~3-8 tok/s	~18-48 days

4.5 Distillation Data Budget

Approach	Teacher Tokens	Time (est.)	Quality
Full corpus (10B tokens)	10B	~30-60 days	Best
Representative subset (2B)	2B	~6-12 days	Good — focus on diverse/hard examples
Curated hard examples (500M)	500M	~2-3 days	Targeted — highest knowledge density

Recommended: Start with the local ground truth corpora (~50-100M raw tokens) plus curated hard examples from StarCoder Python (~400M tokens) for ~500M total. The ground truth corpora should be distilled first — they are our highest quality data and benefit most from teacher knowledge. Scale to 2B with broader StarCoder data if benchmarks justify the compute. Python-only focus means all teacher compute goes toward the language we care about.

4.6 Fallback Teacher: Qwen2.5-Coder-3B

If ALB-010 (MoE inference in realizar) proves harder than estimated, we fall back to Qwen2.5-Coder-3B as a dense teacher:

Property	Value
Model	Qwen2.5-Coder-3B
Parameters	3B (dense)
Architecture	Qwen2 (standard transformer — already supported by realizar)
Compression ratio	8.6x (3B → 350M) — within recommended 5-20x range
CPU inference	~12 GB RAM, ~2 tok/s on 48 cores
License	Apache 2.0

Why this is the fallback, not the primary:

Dense 3B has ~10x less knowledge capacity than 35B MoE
Weaker code capabilities → lower distillation quality ceiling
Soft targets less informative for the student

Why it’s still viable:

Already supported by realizar’s Qwen2 architecture loader (no MoE/DeltaNet)
apr distill --stage precompute verified working with 3B teacher (2026-03-03)
CPU precompute feasible on lambda box (~12 GB RAM)
8.6x compression ratio is in the sweet spot for KD

Config: configs/train/distill-qwen3b.yaml — teacher: Qwen2.5-Coder-3B, student: albor-base-350m, temperature=4.0, alpha=0.5, LoRA rank 16.

4.7 ALB-010 Implementation Status: MoE Inference in Realizar

Status: MERGED — Steps 1-5b merged to main (PR #133, squash-merged).

Step 1: Expert weight types + loading — DONE

MoeExpertWeights struct in gpu/scheduler/types.rs (45 files updated)
Fields: gate_weight, expert_gate_up, expert_down, shared_{gate,up,down}
GpuModelConfig extended with num_experts, num_experts_per_tok, expert_intermediate_size

Step 2: Router forward — DONE (moe_dispatch.rs)

moe_route(): softmax (max-subtracted) → top-k selection → renormalize
3 contract-derived tests pass: stability, uniform routing, order preservation

Step 3: Expert dispatch — DONE (moe_dispatch.rs)

expert_swiglu(): per-expert down(SiLU(gate(x)) * up(x))
moe_forward_token(): routes to k experts + shared expert, weighted sum
2 contract-derived tests pass: shared expert always active, uniform routing averages

Step 4: Integration into forward pass — DONE

All 5 forward block variants integrated: forward_block_refcell, forward_block_single, forward_block_incremental, forward_block_incremental_optimized, forward_block_idx
MoE path activates when block.moe_experts.is_some()
Multi-token forward_block_idx loops per token (MoE routes independently per token)
15,053 total tests pass (0 failures)

Remaining: Safetensors weight loading

Map HuggingFace tensor names (model.layers.{N}.mlp.experts.*) to MoeExpertWeights
Fuse individual expert gate/up projections into expert_gate_up tensor
Blocked on: model download (Qwen3.5-35B-A3B, ~70 GB)

4.8 Provable Contracts for MoE Inference

Two design-by-contract YAMLs written and validated (pv validate PASS) before implementation begins, per engineering discipline Rule #6:

contracts/moe-router-v1.yaml (Router forward):

4 equations: router_logits, softmax_normalization, topk_selection, weight_renormalization
6 invariants: softmax_valid, topk_ordered, renorm_sum_one, expert_count, index_bounds, deterministic
5 falsification tests: softmax stability with large logits, top-8 correctness, renorm ordering, zero gate weight, shape mismatch rejection
1 Kani harness (stub_float strategy for symbolic f32)

contracts/moe-expert-dispatch-v1.yaml (Expert dispatch):

5 equations: expert_swiglu, routed_output, shared_expert, moe_output, fused_gate_up_unfuse
6 invariants: expert_output_shape, weighted_sum_preserves_shape, shared_expert_always_active, expert_independence, unfuse_covers_all, numerical_stability
7 falsification tests: single-expert routing, uniform routing, unfuse round-trip, shared expert unconditional, bounds check, finite outputs, dense FFN equivalence
2 Kani harnesses (bounded_int strategy)

Performance characteristics (from docs/specifications/training-performance.md §6.19):

28 GEMMs per token per MoE layer (vs 3 for dense FFN)
Expert GEMMs are tiny ([2048, 512]) — memory-bandwidth bound at batch=1
Router overhead negligible vs expert computation
Estimated teacher throughput: 50-100 tok/s on RTX 4090 at Q4

4.9 Qwen3.5-35B-A3B Tensor Name Mapping

Architecture class: Qwen3_5MoeForConditionalGeneration (model_type: qwen3_5_moe). All layer tensors use model.language_model.layers.{L} prefix (multimodal wrapper).

MoE Expert Tensors (packed per-layer, not per-expert):

Tensor Name	Shape	Description
`...layers.{L}.mlp.gate.weight`	[256, 2048]	Router: nn.Parameter (not nn.Linear)
`...layers.{L}.mlp.experts.gate_up_proj`	[256, 1024, 2048]	All 256 experts’ fused gate+up
`...layers.{L}.mlp.experts.down_proj`	[256, 2048, 512]	All 256 experts’ down projection
`...layers.{L}.mlp.shared_expert.gate_proj.weight`	[512, 2048]	Shared expert gate (SwiGLU)
`...layers.{L}.mlp.shared_expert.up_proj.weight`	[512, 2048]	Shared expert up
`...layers.{L}.mlp.shared_expert.down_proj.weight`	[2048, 512]	Shared expert down
`...layers.{L}.mlp.shared_expert_gate.weight`	[1, 2048]	Sigmoid gate scaling shared expert

Key architectural detail: The shared expert output is scaled by sigmoid(shared_expert_gate(x)) before adding to the routed expert sum. This was discovered from the HuggingFace source (Qwen3_5MoeSparseMoeBlock) and added to MoeExpertWeights.shared_expert_gate_weight in realizar.

Expert weights are packed: Unlike per-expert indexing (experts.{E}.gate_proj), the main model stores all 256 experts in bulk tensors (experts.gate_up_proj). The MTP (multi-token prediction) head uses per-expert indexing. Realizar handles the packed format directly in MoeExpertWeights.expert_gate_up.

5. Training Data

5.1 Data Philosophy

All datasets either locally owned (MIT/Apache 2.0) or publicly available with permissive licenses
Local-first: Sovereign ground truth corpora are our highest-quality data — curated, tested, type-annotated, and owned. They are upsampled to punch above their token weight.
Exact download URLs, versions, and SHA-256 hashes recorded for all external data
Preprocessing pipeline is deterministic (fixed seed, recorded transforms)
Quality validated by alimentar quality check

5.2 Data Mix (Target: ~10B tokens)

Current status (2026-03-05): v3 dataset in preparation — 2M Python files from codeparrot-clean (~4.4B tokens raw, ~5.3B pretokenized at seq_len=1024). v2 dataset had only 139M tokens (67,977 sequences × 2048), which is 0.9% of Chinchilla-minimum for 350M params. v3 provides sufficient data for 1B+ token training runs. See §5.4.2 for the v3 pipeline.

Following the phi-1 playbook: maximum concentration on Python. phi-1 proved that a small model (1.3B) with focused data and distillation can hit 50% HumanEval — outperforming models 10x its size trained on diluted multi-language corpora.

Key insight from phi-1: Data quality matters more than quantity at small param counts. A 350M model trained on 1B tokens of textbook-quality code can outperform a 350M model trained on 100B tokens of raw GitHub scrapes. We have ~71K curated Python files locally — this is our unfair advantage.

Source	Tokens (est.)	Weight	License	Rationale
StarCoder Python subset (HF)	~4B	40%	Apache 2.0	Bulk Python code diversity; aligns with Qwen3-Coder teacher
Local ground truth corpora (upsampled 10x)	~50-100M raw → ~500M-1B effective	10%	MIT	Highest-quality anchor — see §5.2.1
Local ML framework code	~200-400M	10%	MIT / Apache 2.0	ML/AI Python patterns — see §5.2.2
FineWeb-Edu (subset)	~2B	20%	ODC-BY	Educational web text for docstring understanding
Python textbooks + tutorials (HF)	~1B	10%	Apache 2.0 / CC	“Textbooks Are All You Need” — public educational code
Python docs + PEPs + Stack Overflow	~1B	10%	CC BY-SA	API knowledge, idiomatic patterns

Total: ~10B tokens. Chinchilla-optimal for 350M params is ~7B; we slightly overtrain for benchmark performance (common practice in SmolLM, Phi-1.5).

Python concentration: 80% of training data is Python or Python-adjacent (code, textbooks, docs). The remaining 20% (FineWeb-Edu) provides general language understanding for docstrings, comments, and natural language prompts.

5.2.1 Local Ground Truth Corpora (Tier 1 — Upsampled)

These are our “textbook-quality” data — the phi-1 equivalent. Every file has been curated, tested to 98%+ coverage, and validated by CI. They are upsampled 10x during training because their per-token teaching signal is 10-100x higher than raw GitHub code.

Corpus	Path	Files	Lines (est.)	Quality Signal
depyler examples + tdd-book	`../depyler/examples/`, `../depyler/tdd-book/`	1,845	~219K	Type-annotated, transpiler-validated, 27 stdlib modules, property-tested
hf-ground-truth-corpus	`../hf-ground-truth-corpus/`	11,928	~500K+	98%+ test coverage, zero lint violations, production HF recipes
jax-ground-truth-corpus	`../jax-ground-truth-corpus/`	2,697	~200K+	100% test coverage, full type checking, numerical computing
vllm-ground-truth-corpus	`../vllm-ground-truth-corpus/`	1,118	~100K+	Production inference optimization code
Total		17,588	~1M+	All MIT licensed, all CI-validated

Why upsampling works: phi-1’s “textbook” data was <10% of total tokens but had outsized impact on HumanEval. Our ground truth corpora share the same properties: clean types, complete docstrings, tested correctness, educational structure. The model sees these examples multiple times, reinforcing correct patterns over noisy GitHub code.

depyler corpus is uniquely valuable: Every Python function in the depyler corpus was validated by a transpiler — it has clear types, clean control flow, and provably correct semantics. The tdd-book covers 27 stdlib modules (json, datetime, collections, itertools, os, pathlib, re, etc.) with property-based tests. This teaches the model Python’s standard library idioms at a depth no scraped dataset matches.

5.2.2 Local ML Framework Code (Tier 2)

Large, high-quality Python codebases from our local repos. Not upsampled — used at natural frequency for pattern diversity.

Corpus	Path	Files	Notes
huggingface-fine-tuning	`../huggingface-fine-tuning/`	12,274	Fine-tuning recipes and examples
llms-with-huggingface	`../llms-with-huggingface/`	13,869	LLM integration patterns
HF-Hub-Ecosystem	`../HF-Hub-Ecosystem/`	16,978	Comprehensive HF Hub code
pytorch	`../pytorch/`	4,217	ML framework fundamentals
vllm	`../vllm/`	2,400	Inference serving
databricks-data-engineering	`../databricks-data-engineering/`	3,038	Data engineering patterns
algorithm-competition-corpus	`../algorithm-competition-corpus/`	201	Algorithms + data structures
coursera-stats	`../coursera-stats/`	430	Statistical modeling
cuda-python	`../cuda-python/`	161	GPU computing
Total		53,568	All MIT / Apache 2.0

5.2.3 Pre-Built Local Datasets

File	Path	Format	Size
hf_gtc_corpus.parquet	`../hf-ground-truth-corpus/hf_gtc_corpus.parquet`	Parquet	2 MB
corpus_manifest_v1.json	`../depyler/corpus_manifest_v1.json`	JSON	Tier metadata
corpus_tiers.json	`../depyler/corpus_tiers.json`	JSON	Complexity metrics

5.2.4 Data Sourcing Summary

Local owned data (~71K files, ~1-2M lines):
├── Tier 1: Ground truth corpora (17,588 files) → upsampled 10x
├── Tier 2: ML framework code   (53,568 files) → natural frequency
└── Pre-built: Parquet + JSON manifests

External data (HuggingFace, ~8B tokens):
├── StarCoder Python subset     (~4B tokens)   → bulk diversity
├── FineWeb-Edu                 (~2B tokens)   → general language
├── Python textbooks/tutorials  (~1B tokens)   → educational code
└── Python docs + PEPs + SO     (~1B tokens)   → API knowledge

Sovereign data advantage: 20% of training tokens come from data we own, curate, and can improve. Unlike scraped web data, we know the provenance, license, and quality of every file. If benchmarks reveal weaknesses in specific Python patterns, we can add targeted examples to our ground truth corpora and retrain — a feedback loop no public-dataset-only approach can match.

5.3 Fill-in-the-Middle (FIM) Training

Code completion requires fill-in-the-middle capability, not just left-to-right generation. During training, a fraction of code sequences are transformed using the PSM (Prefix-Suffix-Middle) format:

<fim_prefix>def fibonacci(n):<fim_suffix>    return fib_sequence<fim_middle>
    fib_sequence = [0, 1]
    for i in range(2, n):
        fib_sequence.append(fib_sequence[-1] + fib_sequence[-2])

Parameter	Value	Rationale
FIM rate	50% of code sequences	SantaCoder/StarCoder standard
FIM format	PSM (Prefix-Suffix-Middle)	Most common, best tooling support
Special tokens	`<fim_prefix>`, `<fim_suffix>`, `<fim_middle>`	Added to BPE vocabulary
Context split	Random split point per sequence	Uniform distribution over valid positions

~~Gap ALB-018~~: FIXED — alimentar fim supports PSM/SPM transforms. Verified: alimentar fim mixed.parquet -o out.parquet --rate 0.5 --format psm --seed 42 produces correct FIM-encoded sequences. Used in v2 data pipeline.

This is critical — without FIM, the model is a text generator, not a code completion engine.

5.4 Data Pipeline

# ── Step 1: Ingest local ground truth corpora (Tier 1 — highest quality) ──
alimentar import local ../depyler/examples/ ../depyler/tdd-book/tests/ \
  --lang python --output ./data/local/depyler.parquet
alimentar import local ../hf-ground-truth-corpus/ \
  --lang python --output ./data/local/hf-gtc.parquet
alimentar import local ../jax-ground-truth-corpus/ \
  --lang python --output ./data/local/jax-gtc.parquet
alimentar import local ../vllm-ground-truth-corpus/ \
  --lang python --output ./data/local/vllm-gtc.parquet

# ── Step 2: Ingest local ML framework code (Tier 2) ──
alimentar import local \
  ../huggingface-fine-tuning/ ../llms-with-huggingface/ ../HF-Hub-Ecosystem/ \
  ../pytorch/ ../vllm/ ../databricks-data-engineering/ \
  ../algorithm-competition-corpus/ ../coursera-stats/ ../cuda-python/ \
  --lang python --output ./data/local/ml-frameworks.parquet

# ── Step 3: Download external data (on intel — 300GB RAM) ──
alimentar import hf bigcode/starcoderdata --lang python --output ./data/starcoder-python/
alimentar import hf HuggingFaceFW/fineweb-edu --output ./data/fineweb-edu/

# ── Step 4: Quality validation ──
alimentar quality check ./data/local/ --profile ml-training
alimentar quality check ./data/starcoder-python/ --profile ml-training
alimentar quality check ./data/fineweb-edu/ --profile ml-training

# ── Step 5: Filter, dedup, shard ──
alimentar filter ./data/starcoder-python/ --lang python --min-tokens 32 --max-tokens 8192 \
  --dedup --output ./data/processed/starcoder-python.parquet
alimentar convert ./data/fineweb-edu/ ./data/processed/fineweb-edu.parquet

# ── Step 6: Build training mix with upsampling ──
alimentar mix \
  --input ./data/processed/starcoder-python.parquet --weight 0.40 \
  --input ./data/local/depyler.parquet --weight 0.025 --upsample 10 \
  --input ./data/local/hf-gtc.parquet --weight 0.025 --upsample 10 \
  --input ./data/local/jax-gtc.parquet --weight 0.025 --upsample 10 \
  --input ./data/local/vllm-gtc.parquet --weight 0.025 --upsample 10 \
  --input ./data/local/ml-frameworks.parquet --weight 0.10 \
  --input ./data/processed/fineweb-edu.parquet --weight 0.20 \
  --input ./data/processed/textbooks.parquet --weight 0.10 \
  --input ./data/processed/python-docs.parquet --weight 0.10 \
  --output ./data/mixed/ \
  --seed 42 --shuffle

# ── Step 7: Record provenance ──
alimentar provenance ./data/mixed/ --output ./data/provenance.json

~~Gap ALB-019~~: FIXED — alimentar import local expects data files (CSV/JSON/Parquet), not source code directories. Workaround: scripts/source-to-parquet.py converts Python source repos to Parquet with the Tier 1 schema (file, source, text columns). Used for all Tier 2 imports.

~~Gap ALB-020~~: FIXED — alimentar mix supports weighted proportional sampling. Syntax: alimentar mix file1.parquet:10.0 file2.parquet:1.0 -o out.parquet.

5.4.1 Actual Pipeline (v2 Dataset — 2026-03-03)

The pipeline below produced the v2 dataset (139M tokens, 67,977 sequences):

# ── Step 1: Convert Tier 2 repos to Parquet (alimentar can't read source dirs) ──
for repo in pytorch hf-repos mlflow vllm-full tgi algo-corpus cuda-python llms-with-hf; do
    python3 scripts/source-to-parquet.py ~/src/$repo $repo data/parquet/tier2/$repo.parquet
done
# Result: 28,553 Python files across 8 repos

# ── Step 2: Mix Tier 1 (10x) + Tier 2 (1x) ──
alimentar mix \
  data/parquet/depyler/shard_0000.parquet:10.0 \
  data/parquet/hf-ground-truth/shard_0000.parquet:10.0 \
  data/parquet/jax/shard_0000.parquet:10.0 \
  data/parquet/vllm/shard_0000.parquet:10.0 \
  data/parquet/tier2/pytorch.parquet:1.0 \
  data/parquet/tier2/hf-repos.parquet:1.0 \
  data/parquet/tier2/mlflow.parquet:1.0 \
  data/parquet/tier2/vllm-full.parquet:1.0 \
  data/parquet/tier2/tgi.parquet:1.0 \
  data/parquet/tier2/algo-corpus.parquet:1.0 \
  data/parquet/tier2/cuda-python.parquet:1.0 \
  data/parquet/tier2/llms-with-hf.parquet:1.0 \
  -o data/staging/mixed-expanded.parquet --seed 42
# Result: 45,420 mixed rows

# ── Step 3: Apply FIM (50% PSM) ──
alimentar fim data/staging/mixed-expanded.parquet \
  -o data/staging/mixed-expanded-fim.parquet --rate 0.5 --format psm --seed 42
# Result: 45,420 rows with ~50% FIM-encoded

# ── Step 4: Pretokenize into 2048-length sequences ──
python3 scripts/pretokenize.py \
  --input data/staging/mixed-expanded-fim.parquet \
  --tokenizer models/albor-tokenizer-v2/tokenizer.json \
  --seq-len 2048 \
  --output data/pretokenized-2048-v2/train/train.parquet
# Result: 67,977 sequences × 2048 = 139,218,944 tokens (191 MiB)

# Validation set: reuse v1
cp data/pretokenized-2048/val/val.parquet data/pretokenized-2048-v2/val/val.parquet

5.4.2 v3 Dataset Pipeline — codeparrot-clean (2026-03-05)

The v3 dataset scales from 139M to ~5.3B tokens using codeparrot/codeparrot-clean (5M Python files on HuggingFace, no gating). Quality filtered and pretokenized at seq_len=1024 for the 350M model’s max_position_embeddings.

# Step 1: Stream and filter from HuggingFace (2M files, ~8 min)
python3 scripts/download-codeparrot.py \
  --output /mnt/nvme-raid0/albor-data/codeparrot-clean/ \
  --max-rows 2000000
# Filters: skip autogenerated, alpha_frac < 0.25, files > 100KB, < 50 chars
# Result: 2,000,000 files in 20 shards (6.1 GB), ~4.4B raw tokens est.

# Step 2: Pretokenize at seq_len=1024 (streaming shard-by-shard)
python3 scripts/pretokenize.py \
  --input /mnt/nvme-raid0/albor-data/codeparrot-clean/ \
  --tokenizer models/albor-tokenizer-v2/tokenizer.json \
  --seq-len 1024 \
  --output data/pretokenized-1024-v3/train/ \
  --text-column text --shard-output
# Result: ~5.2M sequences × 1024 = ~5.3B tokens in 20 output shards

# Validation set: reuse v1 (814 sequences)

5.5 Tokenizer

Existing capability: aprender::text::tokenize::BpeTokenizer with full train() / encode() / decode() support. entrenar::tokenizer::BPETokenizer provides the training-pipeline integration.

# Plan: validate inputs, estimate vocab training time
apr tokenize plan \
  --input ./data/processed/*.parquet \
  --vocab-size 32768 \
  --algorithm bpe \
  --output ./models/albor-tokenizer/

# Apply: train the tokenizer
apr tokenize apply \
  --input ./data/processed/*.parquet \
  --vocab-size 32768 \
  --algorithm bpe \
  --output ./models/albor-tokenizer/ \
  --seed 42

Gap ALB-001: Verify apr tokenize plan/apply exists as a CLI subcommand. If not, wire aprender::text::tokenize::BpeTokenizer::train() into apr with the plan/apply contract (see §1.5.2).

6. Training Configuration

6.1 Optimizer & Schedule

Parameter	Value	Rationale
Optimizer	AdamW	Standard; in aprender/entrenar
Learning rate	3e-4	Chinchilla-recommended for 350M
Weight decay	0.1	Standard AdamW regularization
Beta1, Beta2	0.9, 0.95	LLaMA/GPT-3 standard
Epsilon	1e-8	Standard
LR schedule	Cosine annealing with warmup	`CosineAnnealingLR` in aprender
Warmup steps	2000 (v1) / 500 (v2)	ALB-060: 2000/5000 = 40%, not 0.2%. v2 config uses 500 (10%) per C-TRAINCFG-001
Min LR	3e-5	10% of peak (standard)
Gradient clipping	1.0 (global norm)	Stability
Batch size (global)	512K tokens	~512 sequences x 1024 tokens
Micro-batch (4090)	4	GPU-resident (batch=8 OOM at seq≥1024)
Gradient accumulation	1 (ALB-066)	Per-block CPU accumulation now works (PerBlockGradientAccumulator); kept at 1 for v2 config
Total training tokens	Target 10B; current 139M (v2 dataset)	~5000 steps × 4 seqs × 1024 tokens = 20M tokens/run (v2: 68K seqs)
Mixed precision	fp16 (CUDA)	Hardware-appropriate

6.2 Training Config: `configs/train/pretrain-350m-v2.yaml`

A single YAML file defines everything — model architecture and training hyperparameters. This is the industry standard (Axolotl, torchtune, HuggingFace Trainer). One file, one truth. apr train validate lints it before GPU time.

Current config (v2 — expanded dataset, ALB-066 gradient_accumulation=1):

# configs/train/pretrain-350m-v2.yaml — Albor 350M with expanded dataset
# C-TRAINCFG-001: steps_per_epoch=16994 >= max_steps=5000

model:
  path: "."                                  # From scratch (random init)
  mode: transformer
  architecture:
    hidden_size: 1024                       # d_model
    num_hidden_layers: 24
    num_attention_heads: 16                 # d_head = 64
    num_key_value_heads: 4                  # GQA 4:1 ratio
    intermediate_size: 4096                 # SwiGLU FFN (gate + up + down)
    vocab_size: 32768                       # ByteLevel BPE (v2 tokenizer)
    max_position_embeddings: 1024           # Context length (2048 OOM'd on 4090)
    rms_norm_eps: 1.0e-5

data:
  train: "data/pretokenized-2048-v2/train/" # Expanded v2 dataset (68K sequences)
  val: "data/pretokenized-2048/val/"
  batch_size: 4                             # Micro-batch (batch=8 OOM'd)
  seq_len: 1024
  tokenizer: "models/albor-tokenizer-v2/tokenizer.json"
  input_column: "input_ids"                 # Pre-tokenized: List<u32> column

optimizer:
  name: "adamw"
  lr: 3.0e-4
  beta1: 0.9
  beta2: 0.95
  weight_decay: 0.1

training:
  mode: "causal_lm"
  epochs: 1                                 # C-TRAINCFG-001: steps_per_epoch=16994 >= 5000
  # grad_clip: 1.0                           # ALB-067: disabled (CPU-side L2 norm bottleneck)
  lr_scheduler: "cosine"
  warmup_steps: 500                         # 10% of max_steps (C-TRAINCFG-001)
  gradient_accumulation: 1                  # ALB-066: per-sequence optimizer (no true accum in CUDA)
  mixed_precision: "fp16"
  output_dir: "./checkpoints/albor-base-350m-v2"
  save_interval: 25
  max_steps: 5000

Legacy v1 config (pretrain-350m.yaml) used 22K sequences with gradient_accumulation: 128 and epochs: 117 — see ALB-060 for why epochs: 1 was fatal with the original data size.

Note on YAML numeric formatting: YAML supports underscore notation natively (32_768, 1_000_000) for human-readable large numbers. All albor configs use this convention. For shorthand like 10B or 512K, see gap ALB-021.

6.3 Training Workflow (Plan/Apply)

# Step 1: Plan — validate config, estimate VRAM, show execution plan (no GPU)
apr train plan configs/train/pretrain-350m.yaml

# Step 2: Apply — execute the training run
apr train apply configs/train/pretrain-350m.yaml --seed 42

# Step 3: Resume if interrupted (apply with --resume)
apr train apply configs/train/pretrain-350m.yaml \
  --resume checkpoints/albor-base-350m/checkpoint-step-5000.json \
  --seed 42

Plan phase (apr train plan):

Schema validation: required keys, correct types, valid enum values
Architecture sanity: hidden_size divisible by num_attention_heads, num_kv_heads divides num_attention_heads
VRAM budget: computes model size + optimizer + activations, warns if > GPU capacity
Data paths: confirms train: and val: directories exist with Parquet/tokenized shards
Tokenizer: loads tokenizer, checks vocab size matches model.vocab_size
Time estimate: estimated wall time based on model size and hardware
Prints structured plan summary (see §1.5.2 for output format)
No GPU, no writes, no network. Runs on CPU in seconds.

Apply phase (apr train apply):

Reads the same YAML, builds a random-initialized Transformer with the model: section architecture, runs the causal LM training loop via entrenar
Checkpoints every save_interval steps — resumable on crash
No Rust code needed — just one config file

apr train validate is an alias for apr train plan --strict — schema-only checking without resource estimation. Fast enough for CI.

6.4 GPU-Resident Training (CudaTransformerTrainer)

The CudaTransformerTrainer (ALB-040) keeps all 24 transformer blocks GPU-resident, reducing PCIe transfers from ~16K/step to exactly 3:

Transfer 1 (H2D): embedding hidden states   ~S×H×4 bytes
Transfer 2 (D2H): logits for cross-entropy  ~S×V×4 bytes
Transfer 3 (H2D): grad_logits to GPU        ~S×V×4 bytes

Each CudaTransformerBlock holds its own weights, AdamW optimizer states (m + v), and shares a CudaGradWorkspace for forward/backward activation buffers. The per-block interleaved backward+optimizer pattern overwrites the shared workspace each layer — memory cost is O(1 block), not O(24 blocks) for activations.

VRAM budget (actual, RTX 4090 24GB):

Component	Memory
24 blocks (weights + AdamW m + v)	~5 GB
Shared workspace (activation/gradient buffers)	~10-12 GB (depends on seq_len)
LM head (weights + AdamW + logits buffer)	~1-2.5 GB
System (Xorg/desktop)	~1 GB

At seq_len=512, batch=4: fits comfortably (~18 GB used). At seq_len=1024, batch=4: fits (~19.5 GB used). At seq_len=2048, batch=4: OOM at LM head alloc (logits [4,2048,32768] too large). At seq_len=2048, batch=8: OOM at block 21 upload.

Dogfooding results:

Config	Steps	Loss	Time	Status
50M quick (seq=512, batch=4)	5	10.42→9.45	~10s	PASS (post ALB-059 fix)
350M test (seq=512, batch=4)	50	10.39→5.92 (best 5.53)	~400s	PASS (post ALB-059 fix)
350M full v1 (seq=1024, batch=4, accum=128)	43/5000	10.39 flat	~12s	FAIL (ALB-060): epochs=1 exhausted data
350M full v2 (seq=1024, batch=4, accum=1)	1183/5000	10.4→6.85	~1.4h	CRASHED: ALB-073 (PTX selp) + ALB-074 (stale binary). Step 1000 ckpt saved.
350M v3 (seq=1024, batch=4, codeparrot)	28K/250K	10.40→6.43	~1.9 days	STOPPED (plateau): val_ppl=1018 at step 28K. 6.7K tok/s, 19.3% MFU. Plateau since step 12K — ALB-079 (no cosine decay) + ALB-080 (batch too small).
350M v4 (seq=1024, batch=4, ga=32)	500	10.40→5.76	~4.7h	Killed by system reboot at step 553. val_ppl=1032.7 at step 500 (matched v3 at 57% token budget). Checkpoint saved.
350M v4-resume (from step 500)	56+	10.40→6.31	est ~2.7 days	RUNNING: Warm-start 8x faster convergence. loss=6.31 at step 37.

ALB-060: Training Configuration Epoch/Step Mismatch (Critical)

The first 350M full training run (2026-03-02) ran only 43 of 5000 steps because epochs: 1 caps total steps to floor(num_sequences / batch_size / grad_accum). With 22,079 sequences, batch=4, accum=128: steps_per_epoch = 43. Warmup (2000 steps) never completed — LR peaked at 6.45e-6 vs target 3e-4. Loss stayed flat at ~10.39 for all 43 steps (never exited warmup). Root cause: no pre-flight algebraic validation of epoch/step consistency.

Fix: C-TRAINCFG-001 contract (contracts/training-config-kernel-v1.yaml) + epochs: 117 for v1 data, or v2 config (pretrain-350m-v2.yaml) with expanded dataset (67,977 sequences, epochs: 38, warmup_steps: 500).

Training stability contracts verified (ALB-044, ALB-059, ALB-060):

C-EMBED-GRAD-001: Activation gradient clipped at GPU→CPU boundary
C-HYPERPARAMS-001: All optimizer params flow from YAML config
C-BUFSIZE-001: Buffer sizes algebraically verified (ALB-043 fix)
C-GRADFLOW-001: All trainable parameters receive gradients (ALB-038 fix)
C-GEMMARGS-001: GEMM backward constructor args match documented order (ALB-059 fix)
C-GPUINIT-001: Optimizer states zero-initialized, not cuMemAlloc garbage (ALB-059 fix)
C-STREAMSYNC-001: stream.synchronize() before any D2H transfer reading kernel output (ALB-065 fix)
C-LOSSSCALE-001: fp16 loss scaling excluded from f32 backward path (ALB-072 fix)
C-SELP-001: PTX selp_f32 argument order verified in all kernels (ALB-069, ALB-073 fixes)
C-EVALBUF-001: eval_single_sequence truncates to max_seq_len before GPU forward (ALB-074 fix)
C-GPUINIT-001: All optimizer m/v buffers zero-initialized (ALB-059 fix)
C-LOSSSCALE-001: fp16 loss scaling excluded from GPU backward (all backward uses f32; scaling causes overflow) (ALB-072 fix)
C-CUBLAS-NOTENCORE-001: cuBLAS uses CUBLAS_DEFAULT_MATH (no tensor cores) — tensor core algorithms produce NaN for transposed backward GEMMs at ~1e5 gradient magnitude (ALB-077 fix)

6.5 Checkpointing Strategy

Aspect	Design
Format	SafeTensors (primary) + JSON metadata
Frequency	Every 1,000 steps (~1.2h at 4.2s/step, ~4M tokens)
Content	Model weights (~1.5 GB), optimizer state (~1.3 GB), config.json
Pruning	Automatic — keeps latest + best only, old checkpoints deleted
Disk usage	~8.4 GB peak (3 checkpoints: current + best + in-flight)
Storage	Local NVMe RAID-0, checkpoints directory in repo
Resume	From latest checkpoint on crash (weights + optimizer state)
Export	`apr publish --format safetensors` for HuggingFace

Checkpoint interval rationale (v3): save_interval: 1000 balances crash recovery (~8.7min max lost work at 525ms/step) against I/O overhead (~3s per checkpoint write vs ~525s between checkpoints = 0.6% overhead). With automatic pruning, disk usage stays constant regardless of training length. For the 250K-step v3 run (~1.5 days at 7,579 tok/s), this yields 250 checkpoint events with ~8.4 GB steady-state disk.

6.6 Experiment Tracking & Training Monitoring

entrenar has a full monitoring stack built in, and presentar provides rich terminal visualization. Albor uses both — no external tools (no W&B, no MLflow, no TensorBoard). Sovereign monitoring, sovereign visualization.

6.6.1 Monitoring Config: `configs/train/pretrain-350m.yaml` (monitoring section)

monitoring:
  terminal:
    enabled: true
    refresh_rate: 1000              # TUI refresh in ms
    metrics: ["loss", "learning_rate", "gradient_norm"]
    charts:
      - type: "loss_curve"
        metric: "loss"
        window: 100                 # Smoothing window
        show_eta: true

  tracking:
    enabled: true
    backend: "sqlite"               # .entrenar/experiments.db (WAL mode)
    experiment: "albor-pretrain-350m"
    tags:
      model: "albor-350m"
      stage: "pretrain"
      data: "python-code-v2"                 # 139M tokens (v2 dataset)

  system:
    enabled: true
    interval: 5000                  # System metrics every 5s
    metrics: ["gpu_utilization", "memory", "temperature"]

  alerts:
    - condition: "loss > 10"
      action: "stop"
      message: "Loss exploded — Andon stop"
    - condition: "gradient_norm > 100"
      action: "stop"
      message: "Gradient explosion — Andon stop"

6.6.2 What Entrenar Monitors Automatically

Component	What It Does	Already Built?
`MetricsCollector`	Records loss, LR, gradient norms per step (SIMD-accelerated)	Yes (entrenar)
`ExperimentTracker`	Tracks run_id, params, metrics, artifacts, status	Yes (entrenar)
`SqliteBackend`	Durable experiment store: runs, params, metrics, artifacts in `.entrenar/experiments.db` (WAL mode)	Yes (entrenar)
`ProgressCallback`	Kalman-filtered ETA, Unicode progress bars	Yes (entrenar)
`MonitorCallback`	Integrates metrics into training, detects NaN/Inf → Andon alert	Yes (entrenar)
`CheckpointCallback`	Saves best model + metadata (epoch, is_best, timestamp)	Yes (entrenar)
`EarlyStopping`	Patience-based stopping on loss plateau	Yes (entrenar)
`Andon alerts`	Toyota Way: Critical/Error/Warning/Info severity levels	Yes (entrenar)
`TuiMonitor`	Detached terminal dashboard composing presentar widgets (ALB-057)	Yes (entrenar + presentar)
`DriftDetector`	PSI, KS, Wasserstein distribution shift detection	Yes (entrenar)
`JsonFileStore`	Real-time metrics to `training_state.json` (atomic writes)	Yes (entrenar)
`LossCurve` widget	Training loss over epochs with EMA smoothing	Yes (presentar)
`ConfusionMatrix` widget	Multi-class classification evaluation	Yes (presentar)
`Braille/Sparkline`	High-resolution terminal charts (2x4 dots/cell, 8-level sparklines)	Yes (presentar)
`Heatmap` widget	2D matrix with CIELAB perceptual color gradients	Yes (presentar)

6.6.3 Live Monitoring During Training

# Terminal 1 (lambda): Run training
apr train apply --task pretrain --config configs/train/pretrain-350m.yaml

# Terminal 2 (lambda or ssh): Attach live monitor (presentar TUI)
apr monitor ./checkpoints/albor-base-350m/

# Terminal 2 (alternative): JSON output for LLM agents / CI
apr monitor --json ./checkpoints/albor-base-350m/

# Discover all active training runs (reads global SQLite registry)
apr monitor

# List past experiments from SQLite registry
apr runs ls --global

# Show detailed metrics for a specific run
apr runs show <run-id> --global --json

# Browse past experiments from SQLite
apr experiment view --db .entrenar/experiments.db

# Compare loss curves across runs
apr experiment view --db .entrenar/experiments.db \
  --runs albor-pretrain-50m,albor-pretrain-350m \
  --metric loss --chart loss_curve

# One-shot profiler (GPU utilization, per-layer timing)
apr cbtop ./checkpoints/albor-base-350m/latest.safetensors

# Inference latency profiling
apr profile ./checkpoints/albor-base-350m/ --prompt "def fibonacci(n):"

# Stack-level health (from batuta)
batuta stack status

6.6.4 Experiment Lifecycle

Each training run creates two data streams:

Real-time (JSON file IPC) — for live TUI monitoring:

checkpoints/albor-base-350m/
├── training_state.json         # Live metrics (loss, lr, grad_norm, GPU telemetry)
├── checkpoint-step-1000.safetensors
├── checkpoint-step-1000.json   # Checkpoint metadata (epoch, is_best)
├── checkpoint-step-2000.safetensors
├── checkpoint-step-2000.json
├── checkpoint-best.safetensors
└── checkpoint-best.json

Durable (dual SQLite experiment stores) — for post-hoc analysis and comparison:

checkpoints/albor-base-350m/.entrenar/
└── experiments.db              # Local per-experiment store (WAL mode)
    ├── experiments             # Experiment metadata (name, description, config)
    ├── runs                    # Training runs (status, timestamps)
    ├── params                  # Hyperparameters (key/value/type)
    ├── metrics                 # Per-step metrics (loss, lr, tok/s, timestamp)
    ├── artifacts               # Model artifacts (path, size, SHA-256)
    └── span_ids                # Distributed trace integration

~/.entrenar/
└── experiments.db              # Global cross-machine registry (WAL mode)
    └── (same schema)           # All runs across all experiments

PretrainTracker (ALB-055/056) writes to both stores on every log interval. All operations are best-effort — storage failures never block training.

Three consumers, zero contention:

apr monitor reads training_state.json (atomic write-then-rename) for live dashboards. Multiple monitors attach simultaneously.
apr runs ls reads ~/.entrenar/experiments.db (global registry) for cross-experiment history. Supports --json for LLM agent consumption.
apr experiment reads local .entrenar/experiments.db (WAL mode) for per-run metric queries and artifact tracking. Read-only during training — no lock contention with the writer.

6.6.5 Presentar Visualization: Rich Terminal Dashboards

presentar (presentar-terminal) provides ML-specific visualization widgets that entrenar’s TrainingDashboard now composes directly (ALB-057). The dashboard builds a widget tree from Layout::rows() of Border-wrapped section panels, each containing Meter, GpuPanel, Sparkline, or Text widgets. The connection point for historical data is entrenar’s SQLite experiment store (.entrenar/experiments.db).

Live training dashboard (apr monitor — reads training_state.json):

╭─ Albor Pre-Train: albor-base-350m ─── Step 12,847 / 19,073 ──── 67.4% ─╮
│                                                                          │
│  Loss                                          GPU (RTX 4090)            │
│  3.2 ⣀⣀                                       ████████████░░░ 82%       │
│      ⠈⠉⠉⠑⠒⠒⠤⣀                                VRAM: 14.2 / 24.0 GB      │
│               ⠈⠉⠑⠒⠤⣀⣀                        Temp: 72°C                │
│  1.8                  ⠈⠉⠒⠒⣀⣀⣀⣀               Power: 312W               │
│                              ⠉⠉⠉              Tokens/s: 18,432          │
│  0 ──────────────────────────────── 12K                                  │
│                                                                          │
│  Learning Rate              Gradient Norm       ETA: 1d 14h 22m          │
│  ⣿⣿⣿⣷⣶⣶⣤⣤⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀     ▁▁▂▁▁▃▁▂▁▁▁▂▁▁    Throughput: 5.2B / 10B   │
│  3e-4 → 2.1e-4              0.42 (norm)        Checkpoint: step-12000    │
╰──────────────────────────────────────────────────────────────────────────╯

Post-hoc experiment comparison (apr experiment view — reads SQLite):

# Compare loss curves across all pre-training runs
apr experiment view --db .entrenar/experiments.db \
  --runs albor-pretrain-50m,albor-pretrain-350m \
  --metric loss --chart loss_curve

# Hyperparameter comparison table
apr experiment view --db .entrenar/experiments.db \
  --experiment albor-pretrain-350m --params

# Export metrics for external analysis (Parquet for alimentar)
apr experiment export --db .entrenar/experiments.db \
  --run albor-pretrain-350m --format parquet --output ./eval/metrics.parquet

Presentar widgets used by albor:

Widget	Use Case	Data Source
`LossCurve`	Training loss over steps with EMA smoothing	`training_state.json` (live) or SQLite `metrics` table (post-hoc)
`Sparkline`	Compact LR schedule, gradient norm history	`training_state.json` lr_history, grad_norm
`Heatmap`	Attention pattern visualization, weight distribution	Model checkpoint tensors
`Gauge`	GPU utilization, VRAM usage, training progress	`training_state.json` gpu telemetry
`BrailleGraph`	High-resolution loss/metric curves over SSH	`training_state.json` loss_history
`Histogram`	Weight distribution per layer (pre/post distillation)	Model checkpoint tensors
`BarChart`	Benchmark scores across model stages	`eval/*.json` results

Two rendering targets, same widgets, same data:

presentar compiles the same widget tree to two targets — terminal and WASM. The dashboard YAML is written once. presentar-terminal renders it via crossterm (works over SSH). presentar renders it via WebGPU in the browser (60fps, GPU-accelerated). Both read from the same data sources.

Mode	Command	Renderer	Data Source	Use Case
Live TUI	`apr monitor ./checkpoints/`	`presentar-terminal` (crossterm)	`training_state.json` (polling)	Watch training over SSH
Experiment TUI	`apr experiment view`	`presentar-terminal` (crossterm)	SQLite `.entrenar/experiments.db`	Compare runs in terminal
Web dashboard	`presentar serve --config albor-dashboard.yaml`	`presentar` (WebGPU/WASM)	SQLite + checkpoints	Rich browser dashboard

Both TUI and WASM are first-class deliverables, not stretch goals. The terminal TUI is the primary interface (SSH to lambda/intel). The WASM dashboard is the shareable artifact for model cards and teaching.

6.6.6 No External Dependencies

What Others Use	What Albor Uses Instead	Why
Weights & Biases	entrenar `SqliteBackend` + presentar dashboards	Sovereign — no cloud, no API keys, all data local
TensorBoard	presentar `LossCurve` + `BrailleGraph` over SSH	No Python, no browser required, works over SSH
MLflow	entrenar `ExperimentTracker` + SQLite + `apr experiment`	Self-hosted SQLite, no server process, query via CLI
nvidia-smi polling	entrenar system metrics + `apr cbtop`	Integrated into training loop, not bolted on
Streamlit dashboards	presentar WASM dashboard (10x faster rendering)	GPU-accelerated, 60fps, zero Python

7. Post-Training Improvement Ladder

Each stage improves the model and exercises a different entrenar / apr capability. Every stage produces a benchmarked checkpoint.

7.1 Stage 1: Pre-Train Base Model

apr train plan configs/train/pretrain-350m.yaml          # Validate + VRAM estimate
apr train apply configs/train/pretrain-350m.yaml --seed 42

Produces: albor-base-350m — raw pre-trained model Exercises: entrenar, trueno (CUDA), alimentar (data streaming) Expected: OPT-350M class on general benchmarks (~48% avg). On HumanEval, target >8% (above random, below CodeGen-350M’s 12.8% due to less training data)

7.2 Stage 2: Knowledge Distillation from Qwen3-Coder-Next

# Plan: check teacher fits in RAM, estimate logit disk usage
apr distill plan configs/train/distill.yaml

# Apply phase 1: Pre-compute teacher logits on intel (300GB RAM, CPU inference)
apr distill apply configs/train/distill.yaml --stage precompute

# Apply phase 2: Distill into student on lambda (4090)
apr distill apply configs/train/distill.yaml --stage train

Produces: albor-distill-350m — distilled model with teacher knowledge Exercises: realizar (teacher inference), apr distill, alimentar (logit storage) Expected: Moderate improvement — absorbs coding patterns from 80B teacher. Estimated +2-7 points on HumanEval via logit-level KD. Note: MoE→dense distillation is uncharted at this scale; the architecture mismatch (DeltaNet+MoE teacher → LLaMA-style dense student) may limit transfer compared to dense→dense distillation (e.g., GPT-3.5→phi-1).

7.3 Stage 3: Instruction Fine-Tuning (LoRA/QLoRA)

apr finetune plan configs/train/finetune-lora.yaml        # Validate LoRA config + VRAM
apr finetune apply configs/train/finetune-lora.yaml

Produces: albor-instruct-350m — instruction-following model Exercises: apr finetune, entrenar LoRA, alimentar (JSONL instruction data) Expected: Better IFEval scores, improved structured output, chat capability.

7.4 Stage 4: Model Merging

apr merge plan \
  --models albor-distill-350m,albor-instruct-350m \
  --method slerp --weight 0.6 \
  --output ./checkpoints/albor-merged/
# Plan checks: architectures compatible, method valid, output size estimate

apr merge apply \
  --models albor-distill-350m,albor-instruct-350m \
  --method slerp --weight 0.6 \
  --output ./checkpoints/albor-merged/

Produces: albor-merged-350m — best-of-all-worlds model Exercises: apr merge (SLERP, TIES, DARE algorithms) Expected: Cherry-picks strengths from each variant. Potentially better than any single model on diverse benchmarks.

7.5 Stage 5: Pruning

apr prune plan \
  --model ./checkpoints/albor-merged-350m/ \
  --method wanda --sparsity 0.5 \
  --output ./checkpoints/albor-pruned/
# Plan checks: model exists, sparsity in [0,1], output size estimate

apr prune apply \
  --model ./checkpoints/albor-merged-350m/ \
  --method wanda --sparsity 0.5 \
  --output ./checkpoints/albor-pruned/

Produces: albor-pruned-175m — half the parameters, similar performance Exercises: apr prune (WANDA, SparseGPT, magnitude, depth pruning) Expected: ~2-5% benchmark degradation at 50% sparsity. WANDA is well-studied at larger scales (7B+) but less validated at 350M where there is less redundancy. Depth pruning to ~18 layers yields ~260M params.

7.6 Stage 6: Quantization

apr quantize plan \
  --model ./checkpoints/albor-merged-350m/ \
  --method q4_k \
  --output ./checkpoints/albor-q4/
# Plan checks: model exists, format valid, output size estimate (~90MB)

apr quantize apply \
  --model ./checkpoints/albor-merged-350m/ \
  --method q4_k \
  --output ./checkpoints/albor-q4/

# Export for broad compatibility
apr export plan --model ./checkpoints/albor-q4/ --format gguf
apr export apply \
  --model ./checkpoints/albor-q4/ \
  --format gguf \
  --output ./release/albor-350m-q4_k.gguf

Produces: albor-q4-350m — 4-bit quantized, ~90MB on disk Exercises: apr quantize, apr export (GGUF, SafeTensors) Expected: <1% benchmark loss from Q4_K quantization. Model runs on any device — phones, Raspberry Pi, browsers (WASM via trueno).

7.7 Benchmark Trajectory

Every stage is benchmarked. The trajectory itself is a key result. Code completion metrics (HumanEval, FIM) are primary; general benchmarks are secondary.

Stage	Model	Params	Size	HumanEval	MBPP	CPU tok/s
1	albor-base	350M	~700MB	~8%	~8%	—
2	albor-distill	350M	~700MB	~13-15%	~10-12%	—
3	albor-instruct	350M	~700MB	~14-16%	~11-13%	—
4	albor-merged	350M	~700MB	~15-17%	~12-14%	—
5	albor-pruned	~175M	~350MB	~12-14%	~10-12%	—
6	albor-q4	350M	~90MB	~14-16%	~11-13%	>50

Numbers are estimates. The distillation gain (+2-7 points over base) assumes 500M-2B tokens of teacher logits. This is conservative — published distillation results show larger gains with dense teachers (phi-1 used GPT-3.5, a dense model). Our MoE→dense distillation path is uncharted at 350M scale. The FIM column is removed because there is no standardized FIM benchmark — we will define our own eval and report absolute numbers, not targets. CPU tok/s measured on Xeon at Q4.

8. Evaluation & Benchmarks

8.1 Evaluation Strategy

Leaderboard target: Big Code Models Leaderboard — the standard HuggingFace leaderboard for code generation models. Uses HumanEval (pass@1) and MultiPL-E (18 languages). Currently tracks ~60 models. No sub-1B model has ever appeared on this leaderboard. The smallest entries are 1.0B (DeciCoder-1B at 19.3%, phi-1 at 50.6%, SantaCoder at 18.1%). Albor would be the first sub-1B entry — and the only model trained in Rust.

Secondary: Classic lm-evaluation-harness benchmarks (zero-shot) for general capability comparison against Pythia, OPT, GPT-2.

NOT targeting: Open LLM Leaderboard v2 (IFEval, BBH, MATH Level 5, GPQA, MuSR, MMLU-PRO). These benchmarks were designed for large models — a 350M model scores near random on MATH Level 5 (~0%), GPQA (~25%), and MMLU-PRO (~10%).

Also NOT targeting: EvalPlus Leaderboard (HumanEval+, MBPP+). A secondary submission target if results are strong, but the Big Code leaderboard is the primary scoreboard.

8.2 Benchmark Suite

Python Code Completion Benchmarks (Primary — matches use case)

Benchmark	Type	Metric	What It Tests	Leaderboard?
HumanEval	Function generation	pass@1, pass@10	Complete a Python function given docstring	Big Code LB
MultiPL-E	Multilingual code gen	pass@1	HumanEval translated to 18 languages (Python-only for us)	Big Code LB
MBPP	Basic programming	pass@1	Solve simple Python programming tasks (3-shot)	—
DS-1000	Data science	pass@1	Pandas/NumPy/sklearn code generation	—
FIM (custom)	Fill-in-the-middle	exact match	Infill Python code between prefix and suffix	—
Latency	Inference speed	tok/s	Tokens per second on CPU (Q4) and GPU (fp16)	Big Code LB

General Capability Benchmarks (Secondary — validates base model quality)

Benchmark	Type	Random	What It Tests
ARC-Easy	Science reasoning	25%	Elementary science knowledge
HellaSwag	Commonsense completion	25%	Sentence completion with physical intuition
PIQA	Physical intuition	50%	Physical interaction Q&A
LAMBADA	Next-word prediction	0%	Long-range dependency in text

8.3 Understanding Perplexity

Perplexity is the primary metric for monitoring pre-training progress. It measures how well the model predicts held-out text:

perplexity = e^(cross_entropy_loss)

Intuition: Perplexity is the effective number of tokens the model considers equally likely at each position. A model with perplexity 100 is, on average, choosing between 100 equally probable next tokens. Lower is better — it means the model has learned to concentrate probability mass on the correct tokens.

Scale for albor (vocab_size = 32,768):

Perplexity	Meaning	Training Stage
32,768	Random baseline (uniform over vocab)	Untrained / step 0
~1,000	Basic token frequency learned	v3 plateau (step 12K-28K)
~100	Syntactic patterns and common idioms captured	Target for v4 at ~1B tokens
~30	Strong code model — predicts Python structure	Good 350M model
~10	Excellent — narrows predictions to a few candidates	State-of-the-art at this scale

Why perplexity, not loss: Cross-entropy loss (ln(perplexity)) compresses the scale. Loss 6.93 vs 6.83 sounds small but corresponds to perplexity 1018 vs 922 — a 10% improvement in prediction quality. Perplexity makes the magnitude of improvements human-readable.

Validation perplexity (val_ppl) is computed on held-out data not seen during training. It detects overfitting: if train loss keeps falling but val_ppl plateaus or rises, the model is memorizing rather than generalizing. The v3 training plateau (val_ppl stuck at ~1000 from step 12K to 28K) was diagnosed via validation perplexity — train loss was still falling slightly, but the model had stopped learning generalizable patterns. Root cause: constant learning rate (ALB-079) and insufficient batch size (ALB-080).

8.4 Competitive Baselines

Python Code Completion Baselines (Primary Competition)

Model	Params	HumanEval pass@1	MBPP pass@1	FIM	Data	Notes
phi-1	1.3B	50.6%	55.5%	No	7B (textbooks)	Our direct inspiration — same playbook
phi-1-small	350M	45%†	—	No	7B (textbooks)	Same param count as Albor (†never released — see note)
SantaCoder	1.1B	18%	35%	Yes	236B (The Stack)	FIM-trained, multi-language
StarCoderBase-1B	1B	15.2%	—	Yes	1T (The Stack v2)	Multi-language code model
CodeGen-350M-mono	350M	12.8%	—	No	577B (mixed)	Same param count, no distillation
albor-base (target)	350M	>8%	>8%	Yes	10B	Pre-distillation baseline
albor-distill (target)	350M	>15%	>12%	Yes	10B + distill	Post-distillation from 80B teacher

†phi-1-small caveat: phi-1-small was never publicly released — it exists only as an ablation study in “Textbooks Are All You Need” (Gunasekar et al., 2023). The 45% HumanEval claim is self-reported and has never been independently reproduced. We treat it as an aspirational ceiling, not a verified baseline.

The benchmark to beat is CodeGen-350M-mono (same param count, no distillation, no FIM, 12.8% HumanEval). The realistic target for distillation is +2-5 points over the base model. Albor uses a stronger teacher (80B MoE) but faces a significant architecture mismatch (MoE teacher → dense student) and uses a first-generation Rust training stack instead of PyTorch.

Big Code Models Leaderboard — where Albor would land

CodeGen-350M-mono is not on the leaderboard (never submitted). The smallest models currently on the board are 1B-class. If albor-distill hits >15% HumanEval, it would sit just below the 1B tier — at 1/3 the parameter count:

Model	Params	HumanEval	On Leaderboard?
phi-1	1.3B	50.6%	Yes
DeciCoder-1B	1.0B	19.3%	Yes (smallest entry)
SantaCoder	1.1B	18.1%	Yes
StarCoderBase-1B	1.0B	15.2%	Yes
albor-distill (target)	350M	>15%	Submission target
CodeGen-350M-mono	350M	12.8%	No (never submitted)

Submission protocol: Run bigcode-evaluation-harness with standard params (top-p=0.95, temperature=0.2, n_samples=50), submit PR to the leaderboard’s community_results/ folder. Results marked as “non-verified” (community).

General Capability Baselines (Secondary)

Model	Params	ARC-E	HellaSwag	PIQA	Avg
Pythia-410M	410M	47.1	40.1	67.2	51.5
OPT-350M	350M	41.9	36.2	64.8	47.6
GPT-2 Medium	345M	~43	~34	~66	~48
albor-distill (target)	350M	>42	>36	>65	>48

Note: General capability targets are conservative. Albor is 80% Python code data with a coding teacher — distillation from Qwen3-Coder-Next will not improve general reasoning (ARC-E, HellaSwag). The target is OPT-350M parity, not Pythia-410M. Code benchmarks are the real scoreboard.

8.5 Evaluation Protocol

# Plan: validate model exists, tasks recognized, output writable
apr eval plan \
  --model ./checkpoints/albor-distill-350m/ \
  --tasks humaneval,humaneval_fim,mbpp,ds1000

# Python code completion benchmarks (primary — run after every stage)
apr eval apply \
  --model ./checkpoints/albor-distill-350m/ \
  --tasks humaneval,humaneval_fim,mbpp,ds1000 \
  --output ./eval/python-code-results.json \
  --seed 42

# General capability benchmarks (secondary)
apr eval apply \
  --model ./checkpoints/albor-350m-final/ \
  --tasks arc_easy,hellaswag,piqa,lambada \
  --batch-size 32 \
  --output ./eval/general-results.json \
  --seed 42

# Latency benchmark (critical for code completion use case)
apr bench plan --model ./checkpoints/albor-q4/
apr bench apply \
  --model ./checkpoints/albor-q4/ \
  --prompt "def fibonacci(n):" \
  --max-tokens 128 \
  --device cpu --device cuda \
  --output ./eval/latency-results.json

# Perplexity on held-out Python code
apr eval apply \
  --model ./checkpoints/albor-350m-final/ \
  --perplexity \
  --data ./data/eval/held-out-python.parquet

# ── Big Code Leaderboard submission eval ──
# Must use bigcode-evaluation-harness with standard params for comparability
# This runs OUTSIDE the sovereign stack (Python, not Rust) — it is the
# leaderboard's own eval tool, not ours. Our apr eval results are the
# primary record; this is for leaderboard submission only.
#
# bigcode-evaluation-harness \
#   --model ./release/albor-350m.safetensors \
#   --tasks humaneval,multiple-py \
#   --temperature 0.2 --top_p 0.95 \
#   --n_samples 50 --max_length_generation 512 \
#   --output ./eval/bigcode-leaderboard/

8.6 Continuous Evaluation During Training

The intel box runs eval on the latest checkpoint concurrently with training:

# On intel (300GB RAM), polling for new checkpoints
apr eval apply \
  --model ./checkpoints/latest/ \
  --tasks arc_easy,hellaswag \
  --batch-size 16 \
  --output ./eval/step-$(cat ./checkpoints/latest/step.txt).json

Gap ALB-006: ~~Verify apr eval plan/apply supports these benchmark tasks natively.~~ FIXED: apr eval supports perplexity and classification eval.

Gap ALB-037 (FIXED): apr eval previously ignored loaded weights during inference. Now fixed — realizar run loads trained SafeTensors checkpoints and generates from learned weights. Verified end-to-end with 350M test checkpoint (218 tensors loaded, tokens generated). scripts/eval-perplexity.py provides independent pure-Python perplexity evaluation.

Gap ALB-038 (FIXED): entrenar previously saved initialization weights instead of trained weights due to broken autograd gradient flow. Root cause: RMSNorm::forward_batched() created tensors with no backward op, and MultiHeadAttention::forward() broke Q/K/V gradient chain. Fixed in entrenar@91ba9da (RMSNorm backward) and entrenar@1ede409 (attention backward). All 20 model parameters now receive gradients during training. See GitHub #36.

Gap ALB-059 (FIXED): GEMM backward constructor args n/k swapped in entrenar — baked wrong compile-time stride constants into PTX. Output rows overflowed into optimizer state buffers, causing NaN in AdamW. The 50-step test model trained with this bug had loss 10.39→6.07; after the fix, loss improved to 10.39→5.92. All evaluation results should use the post-fix checkpoint (entrenar@846ae0c). Additionally, all optimizer m/v buffers are now zero-initialized (cuMemAlloc returns uninitialized VRAM).

Gap ALB-060 (CONFIG FIXED): The original “full” 350M training run completed only 43/5000 steps because epochs: 1 with grad_accum: 128 exhausted the 22K-sequence dataset. Fix: C-TRAINCFG-001 contract + v2 config (pretrain-350m-v2.yaml) with expanded 68K-sequence dataset, epochs: 1 (steps_per_epoch = 16994 >= 5000), gradient_accumulation: 1 (ALB-066). The v2 training run (ALB-063) reached step ~1183/5000, loss 10.4→6.9 (clear convergence), then stopped. The checkpoints/albor-base-350m-v2/ checkpoint has partially trained weights. Full evaluation awaits training completion.

8.7 Local Evaluation Infrastructure

The following scripts provide model evaluation independently of apr eval:

# Validate checkpoint integrity (fast, detects ALB-038)
python scripts/eval-perplexity.py checkpoints/albor-base-350m/ --validate-checkpoint

# Validate all canonical solutions (no model needed)
python scripts/eval-code.py configs/eval/python-intermediate.jsonl --validate-only
python scripts/eval-code.py configs/eval/humaneval-subset.jsonl --validate-only

# Full evaluation suite (orchestrates all steps)
bash scripts/run-eval-suite.sh checkpoints/albor-base-350m/

# Perplexity on pre-tokenized validation data
python scripts/eval-perplexity.py checkpoints/albor-base-350m/ \
    --data data/pretokenized-2048/val/val.parquet \
    --max-sequences 100 --seq-len 2048 --threshold 30

# Evaluate via apr serve API (ALB-037 FIXED — realizar loads trained checkpoints)
python scripts/eval-code.py configs/eval/humaneval-subset.jsonl \
    --api http://localhost:8080 --samples 10

# Training convergence validation (FALSIFY-ALBOR-001)
python scripts/validate-training-convergence.py \
    checkpoints/albor-base-350m/training.log

# Convert entrenar checkpoint format for realizar
python scripts/convert-checkpoint.py checkpoints/albor-base-350m/ \
    --config configs/train/pretrain-350m.yaml

Benchmark datasets:

configs/eval/python-intermediate.jsonl — 15 intermediate Python problems
configs/eval/humaneval-subset.jsonl — 20 HumanEval-format problems

8.8 Weight Convention & Checkpoint Format

entrenar stores linear layer weights as [in_features, out_features] in row-major (C) order, and computes forward pass as x @ W (no transpose). This differs from the HuggingFace convention of [out_features, in_features] with x @ W.T.

Component	Convention	Forward	Example: gate_proj
entrenar (training)	[in, out]	`x @ W`	[512, 2048]
HuggingFace (standard)	[out, in]	`x @ W.T`	[2048, 512]
realizar (inference)	[out, in]	`x @ W.T`	[2048, 512]

The convert-checkpoint.py script handles the conversion:

Reads 1D flat tensors from entrenar SafeTensors
Reshapes as [in, out] (entrenar convention)
Transposes to [out, in] (HuggingFace/realizar convention)
Writes new SafeTensors with proper 2D shapes

Embeddings (model.embed_tokens.weight) are stored as [vocab, hidden] in both conventions (indexed by token ID for row lookup).

9. Distributed Training Architecture

9.1 Machine Roles (Revised)

With 300 GB RAM on the intel box, the architecture is asymmetric:

Machine	Primary Role	Secondary Role
lambda (4090)	Student training (GPU)	—
intel (300GB RAM)	Teacher inference (CPU), logit pre-computation	Eval runner, data pipeline, checkpoint backup

9.2 Distillation Split (Primary Distributed Architecture)

The natural multi-machine split is teacher on intel, student on lambda:

┌───────────────────────────────┐                          ┌───────────────────────────┐
│  intel (300 GB RAM)           │    pre-computed logits    │  lambda (RTX 4090)        │
│                               │    as sharded Parquet     │                           │
│  Qwen3-Coder-Next 80B fp16   │ ────────────────────────► │  albor-350M student       │
│  Full model in CPU RAM        │    (rsync / NFS)          │  KD loss + CE loss        │
│  realizar CPU inference       │                           │  Full GPU speed training  │
│  ~5-15 tok/s                  │                           │                           │
│                               │ ◄──── checkpoints ─────  │  apr distill apply    │
│  Concurrent eval runner       │    (rsync / NFS)          │                           │
└───────────────────────────────┘                           └───────────────────────────┘

This requires no gradient sync, no ring all-reduce, no distributed training framework for the distillation stage. The teacher pre-computes logits offline; the student trains at full GPU speed against stored logits. Simple and effective.

9.3 Entrenar Native DDP (Complete)

entrenar has full distributed data parallelism infrastructure (entrenar#133), superseding the repartir approach:

Implemented (all wired end-to-end):

Wire protocol v2: TCP-based message framing with BlockGradientPayload, AveragedBlockGradient, NonBlockGradientPayload, AveragedNonBlockGradient
GradientServer: Coordinator that collects gradients from N workers, averages them (per-block AllReduce), and broadcasts averaged gradients back
WorkerClient: Worker-side TCP client that sends/receives gradient payloads
PerBlockGradientAccumulator: CPU-side gradient accumulator for AllReduce (same one used by ALB-066 single-GPU gradient accumulation)
RingAllReduce: Ring-based averaging for N workers
DistributedCudaTrainer: train_batch() → forward+backward → per-block AllReduce → optimizer step. Wraps CudaTransformerTrainer with distributed comm
train_loop_cuda_distributed(): Full training loop with data sharding by rank, coordinator thread auto-spawn (rank 0), worker connection, epoch iteration
spawn_coordinator_thread(): Background thread running GradientServer for rank 0 process
CLI flags: --distributed --world-size N --rank R inject distributed config into YAML at runtime
11 integration tests: C-DDP-001 weight consistency via BLAKE3, 4-worker ring AllReduce, per-block reverse-order AllReduce

Architecture:

Process 0 (rank=0):                     Process 1 (rank=1):
  GradientServer (bg thread)
  DistributedCudaTrainer                  DistributedCudaTrainer
    └─ CudaTransformerTrainer (GPU 0)       └─ CudaTransformerTrainer (GPU 1)
    └─ WorkerClient → TCP ─────────────────── WorkerClient → TCP

9.4 Original Repartir Gaps (Stretch)

The original plan for distributed training via a standalone repartir crate is now partially superseded by entrenar’s native DDP, but some gaps remain relevant for cross-vendor GPU support:

Gap ALB-002: Ring all-reduce (now partially implemented in entrenar itself). Gap ALB-004: Unified CUDA + wgpu backend dispatch in entrenar. Gap ALB-005: trueno wgpu backward pass (gradient WGSL shaders).

The distillation architecture (Section 9.2) achieves multi-machine utilization without any of these.

9.5 W5700X Role

The W5700X GPUs (2x 8GB each) can assist with:

Eval inference: Run benchmarks on latest checkpoint via wgpu/Vulkan
Partial KV cache offload: Assist CPU-based teacher inference
Future: Participate in gradient-parallel training once ALB-005 is resolved

10. Pipeline Orchestration (`apr pipeline` + forjar DAG)

10.1 Architecture: One Manifest, One DAG

The entire albor pipeline — from bare metal to published model — lives in a single YAML manifest: configs/pipeline/albor.yaml. Forjar’s DAG engine resolves dependencies, tracks state, and dispatches steps across machines. apr pipeline wraps forjar, so the user never calls forjar directly.

apr pipeline plan configs/pipeline/albor.yaml    # Show full DAG, estimate everything
apr pipeline apply configs/pipeline/albor.yaml   # Execute (resumable)
apr pipeline status                              # Show what's converged/pending/failed
apr pipeline drift                               # Detect unauthorized state changes

How it works:

                     configs/pipeline/albor.yaml
                              │
                    apr pipeline plan/apply
                              │
                     forjar DAG engine
                    (Kahn's toposort)
                              │
         ┌────────────┬───────┴───────┬────────────┐
         │            │               │            │
    infra resources   │          task resources    │
    (package, gpu,    │          (run apr cmds,    │
     file, mount,     │           track output)    │
     model)           │               │            │
         │            │               │            │
    forjar native     │     apr train apply        │
    convergence       │     apr distill apply      │
                      │     apr eval apply         │
                      │     apr publish apply      │
                      │               │            │
                 state/lambda/     state/intel/
                 state.lock.yaml   state.lock.yaml

Key properties:

Resumable: BLAKE3 hashes per resource. Re-run skips converged steps.
Multi-machine: Infra + tasks dispatch to lambda or intel via SSH.
Plan/apply: apr pipeline plan shows the full DAG with estimates before committing any resources. Exit 0 if valid, exit 1 with diagnostics.
Idempotent: Same manifest, same state → zero changes (all NoOp).
bashrs linted: All shell fragments in task command: fields are validated by bashrs (Rash v6.65) at plan time. No unvalidated shell reaches execution. bashrs is KING of linting — bashrs make lint validates Makefiles, bashrs lint validates shell scripts, bashrs classify classifies safety.

Dual orchestration:

forjar manifest (configs/pipeline/albor.yaml): Infrastructure provisioning (GPU drivers, packages, directories, mounts, teacher model download). Blocked on type: task (ALB-027) for ML steps.
batuta playbook (configs/pipeline/albor-playbook.yaml): ML pipeline orchestration (data prep, train, distill, finetune, merge, prune, quantize, eval, publish). 19-stage deterministic DAG with BLAKE3 caching. Validates successfully.

10.2 Pipeline Manifest: `configs/pipeline/albor.yaml`

version: "1.0"
name: albor-training-pipeline
description: "Sovereign Python code completion model — full pipeline"

machines:
  lambda:
    hostname: lambda
    addr: 127.0.0.1
    user: noah
    arch: x86_64
    roles: [gpu-train, student]

  intel:
    hostname: intel
    addr: intel
    user: noah
    ssh_key: ~/.ssh/id_ed25519
    arch: x86_64
    roles: [teacher-inference, data-pipeline, eval, checkpoint-backup]

resources:
  # ═══════════════════════════════════════════════════════════
  # INFRASTRUCTURE (forjar native resources)
  # ═══════════════════════════════════════════════════════════

  cuda-driver:
    type: gpu
    machine: lambda
    gpu_backend: nvidia
    driver_version: "550"
    cuda_version: "12.4"
    persistence_mode: true
    compute_mode: exclusive_process

  vulkan-driver:
    type: package
    machine: intel
    provider: apt
    state: present
    packages: [mesa-vulkan-drivers, vulkan-tools, libvulkan-dev]

  data-dir:
    type: file
    machine: [lambda, intel]
    path: /data/albor
    state: directory
    mode: "0755"

  teacher-model:
    type: model
    machine: intel
    name: Qwen/Qwen3-Coder-Next
    state: present
    cache_dir: /data/albor/models/teacher
    depends_on: [data-dir]

  checkpoint-share:
    type: mount
    machine: intel
    source: "lambda:/data/albor/checkpoints"
    path: /data/albor/checkpoints
    fstype: nfs
    options: "rw,sync,no_subtree_check"
    depends_on: [data-dir]

  logit-share:
    type: mount
    machine: lambda
    source: "intel:/data/albor/teacher-logits"
    path: /data/albor/teacher-logits
    fstype: nfs
    options: "ro,sync"
    depends_on: [data-dir]

  # ═══════════════════════════════════════════════════════════
  # DATA PIPELINE (task resources — call apr subcommands)
  # ═══════════════════════════════════════════════════════════

  ingest-local:
    type: task
    machine: lambda
    command: >
      alimentar import local ../depyler/examples/ ../depyler/tdd-book/tests/
        --lang python --output ./data/local/depyler.parquet &&
      alimentar import local ../hf-ground-truth-corpus/
        --lang python --output ./data/local/hf-gtc.parquet &&
      alimentar import local ../jax-ground-truth-corpus/
        --lang python --output ./data/local/jax-gtc.parquet &&
      alimentar import local ../vllm-ground-truth-corpus/
        --lang python --output ./data/local/vllm-gtc.parquet
    output_artifacts: ["./data/local/*.parquet"]
    depends_on: [data-dir]

  ingest-external:
    type: task
    machine: lambda
    command: >
      alimentar import hf bigcode/starcoderdata --lang python
        --output ./data/starcoder-python/ &&
      alimentar import hf HuggingFaceFW/fineweb-edu
        --output ./data/fineweb-edu/
    output_artifacts: ["./data/starcoder-python/", "./data/fineweb-edu/"]
    depends_on: [data-dir]

  data-mix:
    type: task
    machine: lambda
    command: >
      alimentar quality check ./data/ --profile ml-training &&
      alimentar mix
        --input ./data/local/depyler.parquet --weight 0.025 --upsample 10
        --input ./data/local/hf-gtc.parquet --weight 0.025 --upsample 10
        --input ./data/local/jax-gtc.parquet --weight 0.025 --upsample 10
        --input ./data/local/vllm-gtc.parquet --weight 0.025 --upsample 10
        --input ./data/starcoder-python/ --weight 0.40
        --input ./data/fineweb-edu/ --weight 0.20
        --input ./data/processed/python-docs.parquet --weight 0.10
        --output ./data/mixed/ --seed 42 --shuffle
    output_artifacts: ["./data/mixed/"]
    depends_on: [ingest-local, ingest-external]

  tokenize:
    type: task
    machine: lambda
    command: >
      apr tokenize plan --input ./data/mixed/*.parquet --vocab-size 32768
        --output ./models/albor-tokenizer/ &&
      apr tokenize apply --input ./data/mixed/*.parquet --vocab-size 32768
        --output ./models/albor-tokenizer/ --seed 42 &&
      apr tokenize apply --tokenizer ./models/albor-tokenizer/
        --input ./data/mixed/*.parquet --output ./data/tokenized/
        --max-seq-len 2048
    output_artifacts: ["./models/albor-tokenizer/", "./data/tokenized/"]
    depends_on: [data-mix]

  # ═══════════════════════════════════════════════════════════
  # TRAINING (task resources — long-running, checkpoint-aware)
  # ═══════════════════════════════════════════════════════════

  train-50m:
    type: task
    machine: lambda
    command: >
      apr train plan configs/train/pretrain-50m.yaml &&
      apr train apply configs/train/pretrain-50m.yaml --seed 42
    output_artifacts: ["./checkpoints/albor-base-50m/"]
    completion_check: "test -f ./checkpoints/albor-base-50m/checkpoint-best.safetensors"
    depends_on: [tokenize, cuda-driver]

  train-350m:
    type: task
    machine: lambda
    command: >
      apr train plan configs/train/pretrain-350m.yaml &&
      apr train apply configs/train/pretrain-350m.yaml --seed 42
    output_artifacts: ["./checkpoints/albor-base-350m/"]
    completion_check: "test -f ./checkpoints/albor-base-350m/checkpoint-best.safetensors"
    depends_on: [train-50m]

  # ═══════════════════════════════════════════════════════════
  # DISTILLATION (cross-machine: intel produces logits, lambda trains)
  # ═══════════════════════════════════════════════════════════

  distill-logits:
    type: task
    machine: intel
    command: >
      apr distill plan configs/train/distill.yaml &&
      apr distill apply configs/train/distill.yaml --stage precompute
    output_artifacts: ["./data/teacher-logits/"]
    completion_check: "test -d ./data/teacher-logits/ && ls ./data/teacher-logits/*.parquet"
    depends_on: [train-350m, teacher-model, logit-share]

  distill:
    type: task
    machine: lambda
    command: >
      apr distill apply configs/train/distill.yaml --stage train --seed 42
    output_artifacts: ["./checkpoints/albor-distill/"]
    completion_check: "test -f ./checkpoints/albor-distill/checkpoint-best.safetensors"
    depends_on: [distill-logits]

  # ═══════════════════════════════════════════════════════════
  # POST-TRAINING LADDER (sequential, each depends on previous)
  # ═══════════════════════════════════════════════════════════

  finetune:
    type: task
    machine: lambda
    command: >
      apr finetune plan configs/train/finetune-lora.yaml &&
      apr finetune apply configs/train/finetune-lora.yaml
    output_artifacts: ["./checkpoints/albor-instruct/"]
    depends_on: [distill]

  merge:
    type: task
    machine: lambda
    command: >
      apr merge plan --models albor-distill-350m,albor-instruct-350m
        --method slerp --weight 0.6 --output ./checkpoints/albor-merged/ &&
      apr merge apply --models albor-distill-350m,albor-instruct-350m
        --method slerp --weight 0.6 --output ./checkpoints/albor-merged/
    output_artifacts: ["./checkpoints/albor-merged/"]
    depends_on: [finetune]

  prune:
    type: task
    machine: lambda
    command: >
      apr prune plan --model ./checkpoints/albor-merged-350m/
        --method wanda --sparsity 0.5 --output ./checkpoints/albor-pruned/ &&
      apr prune apply --model ./checkpoints/albor-merged-350m/
        --method wanda --sparsity 0.5 --output ./checkpoints/albor-pruned/
    output_artifacts: ["./checkpoints/albor-pruned/"]
    depends_on: [merge]

  quantize:
    type: task
    machine: lambda
    command: >
      apr quantize plan --model ./checkpoints/albor-merged-350m/
        --method q4_k --output ./checkpoints/albor-q4/ &&
      apr quantize apply --model ./checkpoints/albor-merged-350m/
        --method q4_k --output ./checkpoints/albor-q4/
    output_artifacts: ["./checkpoints/albor-q4/"]
    depends_on: [merge]

  # ═══════════════════════════════════════════════════════════
  # EVALUATION (can run on intel concurrently with training)
  # ═══════════════════════════════════════════════════════════

  eval-code:
    type: task
    machine: lambda
    command: >
      apr eval plan --model ./checkpoints/albor-merged-350m/
        --tasks humaneval,humaneval_fim,mbpp,ds1000 &&
      apr eval apply --model ./checkpoints/albor-merged-350m/
        --tasks humaneval,humaneval_fim,mbpp,ds1000
        --output ./eval/python-code-results.json --seed 42
    output_artifacts: ["./eval/python-code-results.json"]
    depends_on: [merge]

  eval-general:
    type: task
    machine: intel
    command: >
      apr eval apply --model ./checkpoints/albor-merged-350m/
        --tasks arc_easy,hellaswag,piqa,lambada
        --output ./eval/general-results.json --seed 42
    output_artifacts: ["./eval/general-results.json"]
    depends_on: [merge, checkpoint-share]

  # ═══════════════════════════════════════════════════════════
  # RELEASE
  # ═══════════════════════════════════════════════════════════

  export:
    type: task
    machine: lambda
    command: >
      apr export plan --model ./checkpoints/albor-q4/ --format gguf &&
      apr export apply --model ./checkpoints/albor-q4/ --format gguf
        --output ./release/albor-350m-q4_k.gguf &&
      apr export apply --model ./checkpoints/albor-merged-350m/
        --format safetensors
        --output ./release/albor-350m.safetensors
    output_artifacts: ["./release/"]
    depends_on: [quantize, eval-code]

  publish:
    type: task
    machine: lambda
    command: >
      apr publish plan --model ./release/ --hub paiml/albor-350m &&
      apr publish apply --model ./release/ --hub paiml/albor-350m
    depends_on: [export, eval-general]

policy:
  failure: stop_on_first
  parallel_machines: true
  retry: 2
  bashrs_lint: true            # Validate all task command: fields via bashrs

10.3 Pipeline Workflow

# Show full DAG with time/resource estimates (no side effects)
apr pipeline plan configs/pipeline/albor.yaml

# Execute everything (resumable — skips converged steps)
apr pipeline apply configs/pipeline/albor.yaml

# Check what's done, what's pending, what failed
apr pipeline status

# Detect unauthorized changes to converged resources
apr pipeline drift

# Re-run only failed steps (everything else is NoOp)
apr pipeline apply configs/pipeline/albor.yaml

# Force re-run a specific resource and its dependents
apr pipeline apply configs/pipeline/albor.yaml --target train-350m --force

10.4 The `task` Resource Type (ALB-027)

The task resource is what makes forjar a pipeline orchestrator, not just an infrastructure tool. It runs an arbitrary command, tracks completion, and hashes output artifacts for idempotency.

Field	Type	Description
`command`	string	Shell command to execute (bashrs-validated at plan time)
`output_artifacts`	list[string]	Paths to hash for idempotency (glob-supported)
`completion_check`	string	Optional shell expression to verify completion (e.g., checkpoint exists)
`timeout`	duration	Max wall time before Andon stop (default: none)
`resume_command`	string	Optional command for resuming interrupted long-running tasks

Idempotency for ML tasks: A task resource is considered converged when:

The command exited 0 on a previous run, AND
The BLAKE3 hash of output_artifacts matches the lock file, AND
The completion_check (if set) passes

If any of these fail, the task is re-run. For training jobs that crashed mid-run, the command itself includes --resume logic (e.g., apr train apply auto-detects and resumes from the latest checkpoint).

10.5 Why Not Makefile / Shell Scripts

Approach	DAG	State	Resume	Multi-Machine	Lint
`apr pipeline` (forjar)	Kahn’s toposort	BLAKE3 lock files	Automatic (skip converged)	Native SSH dispatch	bashrs at plan time
Makefile	File timestamps only	None	Manual	None (SSH in recipes)	None
Shell scripts	Sequential only	None	Manual	Manual SSH	ShellCheck (external)

The Makefile and shell scripts are eliminated. One manifest. One DAG. One tool.

11. Gap Register

Every gap discovered during development is tracked here. Each gap maps to a specific upstream component, a GitHub issue, and a clear acceptance criterion.

Lifecycle: Gap discovered → GitHub issue filed → implemented upstream → wired into apr → dogfooded in albor pipeline → FALSIFY/pmat verified → closed.

Status	Meaning
OPEN	Gap identified, not yet implemented
IN PROGRESS	GitHub issue filed, work underway
DOGFOODING	Implemented, being validated in albor pipeline
CLOSED	Verified working end-to-end, issue closed

11.1 Critical Path Gaps (Block the Improvement Ladder)

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-001	#6	apr (aprender)	`apr tokenize plan/apply` subcommand	Medium	FIXED	`apr tokenize plan` validates inputs + estimates time; `apr tokenize apply` trains BPE/WordPiece/Unigram tokenizer (`aprender@90427205`). Writes vocab.json + merges.txt.
ALB-006	#7	apr (aprender)	`apr eval plan/apply` benchmark harness	High	FIXED	`apr eval --task code --data benchmark.jsonl` evaluates code completion with pass@1 scoring. `apr eval --task plan` validates model + data exist. JSONL format with prompt/test/canonical_solution. Phase 1: structural validation. Phase 2: full inference (ALB-009 prerequisite). (`aprender@4e61297e`)
ALB-007	#8	entrenar	Parquet→LMBatch bridge via alimentar	Medium	FIXED	`load_lm_batches_from_parquet()` reads text or pre-tokenized Parquet (single file or directory of shards) via alimentar. Text columns tokenized with HfTokenizer. Column auto-detection (input_ids/token_ids for pre-tokenized, text/content/code for text). Gated behind `parquet` feature. (`entrenar@a5a2fb7`)
ALB-009	#1	apr (entrenar)	`apr train plan/apply` for pre-training from scratch	Critical	FIXED	`apr train plan --task pretrain --config <yaml>` validates config via entrenar, shows model architecture and training params. `apr train apply --task pretrain --config <yaml>` runs full pre-training via `train_from_yaml()` (TransformerTrainer + CausalLMLoss). Config updated to match entrenar TrainSpec schema. (`aprender@d79ed943`)
ALB-010	#2	realizar	Qwen3.5-35B-A3B MoE inference (teacher for distillation)	Critical	DOGFOODING	Steps 1-5b MERGED (PR #133): types, router, expert dispatch, forward integration, shared expert gate, architecture registration, config fields. Step 6 (PR #135): SafeTensors MoE weight loading — `detect_model_prefix` (ConditionalGeneration wrapper), `extract_layer_generic_with_prefix`, `load_moe_weights` (router, packed experts, shared expert), GPU adapter wiring. 15,054 tests pass. Remaining: end-to-end dogfood with Qwen3.5-35B-A3B model files.
ALB-011	#3	apr (entrenar + realizar)	`apr distill plan/apply` (precompute + train stages)	Critical	FIXED	`apr distill --config <yaml> --plan` validates config, shows teacher/student/training params. `apr distill --config <yaml> --stage precompute` inspects teacher, writes manifest. `apr distill --config <yaml> --stage train` validates precompute manifest, sets up KD training. Local DistillYamlConfig matches entrenar schema. (`aprender@81dd4432`)
ALB-018	#19	entrenar/alimentar	Fill-in-the-Middle (FIM) data transform (PSM/SPM)	High	FIXED	`alimentar fim` transform with PSM/SPM formats, configurable rate/seed (`alimentar@290582d`). `Fim` struct implements `Transform` trait for pipeline integration.
ALB-019	#20	alimentar	`alimentar import local` for local Python files	Medium	FIXED	`alimentar import local` subcommand now available (`alimentar@265541b`). Supports CSV/JSON/JSONL/Parquet format conversion.
ALB-020	#21	alimentar	`alimentar mix` with weighted upsampling	Medium	FIXED	`alimentar mix` with weighted sampling and upsampling now available (`alimentar@64b1e92`). Syntax: `alimentar mix a.parquet:0.8 b.parquet:0.2 -o out.parquet`.
ALB-021	#22	entrenar	Custom model architecture params in YAML	High	FIXED	`ArchitectureOverrides` struct carries YAML manifest `architecture:` params through bridge converter to `TransformerConfig`. Supports all fields: `hidden_size`, `num_layers`, `num_heads`, `num_kv_heads`, `intermediate_size`, `vocab_size`, `max_seq_length`, `rms_norm_eps`, `rope_theta`, `use_bias`. (`entrenar@a414861`)
ALB-022	#23	entrenar	Human-readable value shorthand in YAML configs	Low	FIXED	`parse_human_usize()` and `deserialize_human_usize_opt` support SI suffixes (32K, 1M, 10B, 1T), scientific notation (1e6), and fractional suffixes (1.5K). Applied to ArchitectureConfig and DataConfig fields. (`entrenar@1cb0950`)
ALB-023	#24	apr (aprender)	Plan/apply contract for all subcommands	High	FIXED	Every `apr <cmd>` action command now exposes plan mode: `merge --plan`, `export --plan`, `publish --plan` added to join existing `train plan/apply`, `tokenize plan/apply`, `quantize --plan`, `finetune --plan`, `prune --plan`, `distill --plan`, `eval --task plan`. Pre-dispatch contract validation skipped in plan mode. (`aprender@526a1e4b`)
ALB-024	#25	apr (aprender)	`apr experiment view` — interactive SQLite experiment browser	Medium	FIXED	`apr experiment view --global` opens ratatui TUI with run table, sparkline, and braille loss chart. `--json` mode for CI. Reads local or global `~/.entrenar/experiments.db`. (`aprender@1196d244`)
ALB-025	#26	presentar + apr	`apr monitor` upgrade — presentar widgets for live training TUI	Medium	FIXED	`TrainingDashboard` composes presentar-terminal `Meter`, `GpuPanel`, `Sparkline`, `Text`, `Border`, `Layout` (ALB-057). `TuiApp` handles resize/Ctrl+C/diffing (ALB-047/048). WASM compilation deferred to ALB-026. (`entrenar@0ad416e`)
ALB-026	#27	presentar	WASM training dashboard — `albor-dashboard.yaml`	Medium	OPEN	Declarative YAML dashboard config that renders training metrics, experiment comparison, and model card via `presentar serve`. Embeddable in HuggingFace model card as static WASM artifact.
ALB-027	#4	forjar	`task` resource type for pipeline orchestration	Critical	FIXED	New forjar resource type: runs arbitrary command, tracks exit code, hashes `output_artifacts` for idempotency via b3sum, supports `completion_check` and `timeout`. Handlers: `check_script` (completion_check or artifact existence), `apply_script` (set -euo pipefail, working_dir, timeout), `state_query_script` (b3sum artifacts). Validation: command required, timeout > 0. (`forjar@d14e633`)
ALB-028	#5	apr (aprender)	`apr pipeline plan/apply` wrapping forjar DAG engine	Critical	FIXED	`apr pipeline plan` shows full DAG with 23 resources across 2 machines. `apr pipeline apply` converges via forjar engine. `apr pipeline status` shows state. `apr pipeline validate` checks manifest. Shells out to forjar binary (decoupled). (`aprender@e653d5ca`)

11.2 Distributed Training Gaps (Stretch / Future)

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-002	#9	repartir	Ring all-reduce implementation	High	OPEN	Gradient tensors synchronized across 2+ workers with <5% overhead
ALB-003	#10	entrenar	repartir integration for distributed training	High	OPEN	Training loop calls `repartir::GradientSync` for multi-worker training
ALB-004	#11	entrenar	Unified CUDA + wgpu backend dispatch	Medium	OPEN	Same training config runs on CUDA (4090) and wgpu (W5700X)
ALB-005	#12	trueno	wgpu backward pass (gradient WGSL shaders)	High	OPEN	Compute shaders for matmul_backward, gelu_backward, rmsnorm_backward, attention_backward
ALB-008	#13	repartir	Heterogeneous worker throughput balancing	Medium	OPEN	Workers with different GPU speeds get proportional workload

11.3 Quality & Verification Gaps

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-013	#14	provable-contracts	Knowledge distillation contract	High	DOGFOODING	`knowledge-distillation-kernel-v1.yaml` — committed and passes `pv validate`. 3 equations, 6 obligations, 5 falsification tests, 2 Kani harnesses. Needs binding to entrenar implementation.
ALB-014	#15	provable-contracts	BPE tokenizer contract	Medium	DOGFOODING	`bpe-tokenizer-kernel-v1.yaml` — committed and passes `pv validate`. Roundtrip invariant, FIM sentinel tests. Needs binding to aprender BPE.
ALB-015	#16	provable-contracts	Model merging contract (SLERP, TIES, DARE)	Medium	DOGFOODING	`model-merging-kernel-v1.yaml` — committed and passes `pv validate`. SLERP bound, DARE unbiased estimator. Needs binding.
ALB-016	#17	provable-contracts	Pruning contract (WANDA, magnitude)	Medium	DOGFOODING	`pruning-kernel-v1.yaml` — committed and passes `pv validate`. Sparsity invariant, score ordering. Needs binding.
ALB-017	#18	provable-contracts	Gradient accumulation contract	High	DOGFOODING	`gradient-accumulation-kernel-v1.yaml` — committed and passes `pv validate`. Numerical equivalence, gradient zeroing. Needs binding.

Contract coverage report (pv coverage contracts): 8 contracts, 31 equations, 51 obligations, 34 falsification tests, 10 Kani harnesses, 100% obligation coverage. All contracts at impl=0/N — waiting for upstream bindings.

11.4 Dogfooding-Discovered Gaps

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-029	#28	batuta	`batuta falsify` false positives on project repos	Medium	FIXED	Fixed upstream in `batuta@905a862`: AI-01 searches `configs/`, AI-04 excludes `book-output/`, AI-05 detects pv/forjar validation. Score: 72.2% → 73.1%.
ALB-030	#29	batuta	`batuta stack status` fails without Cargo.toml	Low	FIXED	Fixed upstream in `batuta@371557a`: Falls back to binary detection, discovers 11 installed PAIML tools with versions.
ALB-031	#30	batuta	`batuta hf search` returns mock/placeholder data	Low	OPEN	`batuta hf search model "code completion"` returns live HuggingFace Hub results instead of placeholder models.
ALB-033	#31	apr (aprender)	`apr tokenize` → entrenar tokenizer.json format gap	Medium	DOGFOODING	`apr tokenize apply` produces vocab.json + merges.txt but entrenar expects HuggingFace tokenizer.json. Workaround: Python `tokenizers` lib.
ALB-034	#32	entrenar	`max_steps` config not respected in training loop	Medium	FIXED	`max_steps` wired through YAML manifest → bridge → TrainingParams → TransformerTrainConfig → trainer loop. Training stops when optimizer step count reaches limit (`entrenar@07db101`).
ALB-035	#33	entrenar	Does not write `training_state.json` during training	Medium	FIXED	Added `train_epoch_with_callback()` and per-step logging (~100 lines/epoch) in `entrenar@5d41a96`.
ALB-036	#34	apr (aprender)	BPE tokenizer normalizes whitespace	Medium	DOGFOODING	`split_whitespace()` pre-tokenizer destroys Python indentation. Workaround: ByteLevel BPE v2.
ALB-037	#35	realizar	SafeTensors inference ignores loaded weights	High	FIXED	Root cause chain: ALB-038 (no gradient flow) → ALB-043 (backward_ffn buffer overflow + wrong SwiGLU gradients). Secondary: entrenar didn’t save config.json (`entrenar@6097780`). Verified e2e: `realizar run` loads 350M trained checkpoint (218 tensors), generates tokens from learned weights.
ALB-038	#36	entrenar	Saves initialization weights, not trained weights	Critical	FIXED	Root cause: `RMSNorm::forward_batched()` created tensors with no backward op, blocking all gradient flow. Attention `forward()` also broke Q/K/V gradients. Fixed in `entrenar@91ba9da` (norm backward) and `entrenar@1ede409` (attention backward). All 20 model parameters now receive gradients.
ALB-040	#38	entrenar	GPU-resident pretraining — wire CudaTransformerBlock into TransformerTrainer	Critical	VERIFIED	`CudaTransformerTrainer` in `cuda_trainer.rs` follows classify_pipeline.rs pattern. 3 PCIe transfers/step vs 16K. Auto-detect CUDA with graceful CPU fallback. Contract: `training-gpu-kernel-v1.yaml`. 350M verified: 50-step test loss 10.39→6.07, checkpoint valid, realizar loads + generates. Full training running (seq=1024, batch=4, accum=128).
ALB-041	#39	entrenar	D2D buffer size mismatch in CudaTransformerBlock backward_attention	High	FIXED	`backward_attention()` used `gate_out` (intermediate_size) as temp buffer for `grad_hidden` accumulation, but D2D copy requires exact size match. Fixed: use `o_proj_out` (hidden_size). Also added seq_len truncation and error logging in `CudaTransformerTrainer`. (`entrenar@a48e3d2`)
ALB-042	#40	entrenar	CudaTransformerTrainer runtime errors → silent loss=0.0 instead of CPU fallback	Medium	OPEN	When CUDA operations fail during training (e.g., VRAM contention), trainer should detect N consecutive failures and gracefully fall back to CPU mode. Currently reports loss=0.0 and saves garbage checkpoint. Workaround: `CUDA_VISIBLE_DEVICES=""`.
ALB-043	#41	entrenar	backward_ffn buffer overflow + missing SwiGLU gradients	Critical	FIXED	Two bugs: (1) `silu_backward` wrote [S,I] output into [S,H] buffer (4× overflow → `CUDA_ERROR_ILLEGAL_ADDRESS`). (2) SwiGLU backward missing `×up` factor in gate gradient; `grad_up`/`grad_w_up` completely absent (w_up never trained). Fixed with correct 10-step decomposition using `elementwise_mul_forward`, `silu_forward`, `silu_backward`. (`entrenar@f7805f1`)
ALB-044	#42	entrenar	Unclipped activation gradients + CPU optimizer hyperparameter mismatch cause 350M NaN	Critical	FIXED	Two bugs: (1) Activation gradient from block[0] backward (~1e35) unclipped — per-block clipping only applies to weight gradients in CudaGradWorkspace. (2) CPU AdamW used `default_params(lr)` (β₂=0.999, wd=0.01) instead of YAML config (β₂=0.95, wd=0.1) — 50× bias correction amplification overflows f32. Fixed: C-EMBED-GRAD-001 clips activation gradient before scatter-add; CPU optimizer matches YAML hyperparams. 350M now trains without NaN.
ALB-045	—	entrenar	`train_loop_cuda` does not write `training_state.json` — `apr monitor` blind to pretraining	Critical	FIXED	`write_training_snapshot()` helper in `src/config/train/loader.rs` writes `TrainingSnapshot` to `training_state.json` on every log interval. Both `train_loop_cuda` and `train_loop_cpu` now emit Initializing→Running→Completed snapshots. Verified: `apr monitor checkpoints/albor-base-350m/` shows live TUI with loss curve, GPU name, tok/s, progress during CUDA 350M pretraining. (`entrenar@2ddc11c`)
ALB-046	—	entrenar	GPU telemetry all zeros in `training_state.json` — no live NVML/nvidia-smi data	High	FIXED	`query_gpu_telemetry()` shells out to `nvidia-smi --query-gpu` with CSV output, populates all GpuTelemetry fields. Wired into `write_training_snapshot()`. Verified: util=5%, VRAM=12.0G/24.0G, temp=41°C, power=94W/480W during 350M training (`entrenar@9b53c13`).
ALB-047	—	entrenar	TUI monitor hardcodes width=80, no terminal resize handling	Medium	FIXED	Replaced hand-rolled renderer with presentar-terminal `TuiApp`. Gets terminal resize detection for free from crossterm backend + presentar’s smart diffing. `TuiMonitorConfig.width/height` retained for headless mode only (`entrenar@9b53c13`).
ALB-048	—	entrenar	No signal handling in TUI monitor — Ctrl+C leaves cursor hidden	Medium	FIXED	presentar-terminal `TuiApp::run()` handles Ctrl+C/`q` with clean cursor restore, screen cleanup, and status message. No raw signal handlers needed — crossterm event loop + Drop impl (`entrenar@9b53c13`).
ALB-049	—	entrenar	No keyboard input in TUI monitor — can’t scroll/pause/interact	Low	FIXED	presentar-terminal `TuiApp` provides crossterm event loop with `q` quit and Ctrl+C. Scroll/pause deferred to presentar widget-level interaction (GpuPanel, LossCurve already support focus).
ALB-050	—	apr (aprender)	No `apr runs ls` — can’t list past training experiments	High	FIXED	`apr runs ls` reads local/global SQLite registry, shows table of runs with status, final loss, tok/s, duration. `apr runs show <id>` shows detailed metrics + hyperparameters. Supports `--global`, `--json`, `--status` filter. (`aprender@91641f2e`)
ALB-051	—	apr (aprender)	No run comparison — can’t overlay loss curves from two runs	Medium	FIXED	`apr runs diff <a> <b>` shows side-by-side comparison: inline sparklines, loss trajectory overlay, config diff (only changed params), final metric comparison with verdict (winner by final loss). Supports `--json` for LLM agents. (`aprender@9f9e9f63`)
ALB-052	—	entrenar	SQLite experiment tracking exists but not wired to pretraining	Medium	FIXED	`PretrainTracker` in `config/train/loader.rs` writes to both local and global SQLite stores. Uses existing `SqliteBackend` with `ExperimentStorage` trait. Logs experiment metadata, hyperparameters, and per-step metrics (loss, lr, tok/s). Best-effort — storage failures never block training. (`entrenar@daa0afc`)
ALB-053	—	entrenar	HeadlessOutput JSON missing fields present in TUI	High	FIXED	HeadlessOutput now has full field parity with TUI: `global_step`, `progress_percent`, `loss_history`, `lr_history`, `elapsed_seconds`, `optimizer_name`, `batch_size`, `model_path`, `checkpoint_path`, `executable_path`, `accuracy`, `samples_per_second`, `HeadlessSample`. `From<&TrainingSnapshot>` populates all fields. All 6 headless tests pass. (`entrenar@9b53c13`)
ALB-054	—	entrenar + apr	No multi-job monitoring — can’t watch multiple concurrent training runs	High	FIXED	`apr monitor` (no args) discovers active training runs from global SQLite registry (`~/.entrenar/experiments.db`). Checks for live `training_state.json` in registered output dirs. Lists active runs with experiment name, directory, run ID, start time. `apr monitor <dir>` attaches to specific run. Supports `--json` output for LLM agents. (`aprender@91641f2e`)
ALB-055	—	entrenar	No local SQLite experiment DB per training run	High	FIXED	`PretrainTracker` opens `<output_dir>/.entrenar/experiments.db` for local per-experiment metrics history. Logs experiment metadata, hyperparameters (task, model, optimizer, lr, epochs, batch_size, seq_len, max_steps, device), and per-step metrics (loss, lr, tok/s). All best-effort via `SqliteBackend`. (`entrenar@daa0afc`)
ALB-056	—	entrenar	No global SQLite experiment registry	High	FIXED	`PretrainTracker` opens `~/.entrenar/experiments.db` for global cross-machine experiment registry. Same schema as local: experiment + run + hyperparams + per-step metrics. `apr runs ls --global` reads it. `apr monitor` (no args) discovers active runs from it. (`entrenar@daa0afc`)
ALB-057	—	entrenar	Dashboard paints raw text instead of composing presentar widgets	Medium	FIXED	`TrainingDashboard` composes presentar-terminal widgets via `Layout::rows()`: `Border` for section panels, `Meter` for progress bar, `GpuPanel` for GPU telemetry (with `GpuDevice`/`GpuProcess` conversion from entrenar types), `Sparkline` for loss history, `Text` for info lines. Widget tree rebuilt each frame from snapshot. Panel verification wired into `Brick::verify()` via `layout_can_render()`. (`entrenar@0ad416e`)
ALB-058	—	apr (aprender)	`apr monitor --json` flag missing	Medium	FIXED	`apr monitor --json <dir>` streams headless JSON output with full TUI parity (ALB-053). `apr monitor --format text <dir>` for human-readable log lines. `--json` flag overrides `--format`. Routes to `HeadlessMonitor` for JSON/text, `TuiMonitor` for TUI. (`aprender@91641f2e`)
ALB-059	—	entrenar	GEMM backward constructor args n/k swapped — buffer overflow into optimizer states	Critical	FIXED	`GemmBackwardAKernel::tiled_unrolled(m, k, n, tile)` called with k and n swapped vs trueno constructor `(m, n, k, tile_size)`. Bakes wrong stride constants into PTX: output stride = vocab_size (32768) instead of hidden_size (512) for LM head backward. Rows overflow 64× into adjacent VRAM (m_w_k, v_w_k of block 0). Negative values in v_w_k → sqrt(negative) = NaN in AdamW. Same bug in backward_b. Also zero-initialized all optimizer m/v buffers (cuMemAlloc returns uninitialized VRAM). (`entrenar@846ae0c`)
ALB-060	—	entrenar / albor config	`epochs: 1` exhausts data before `max_steps` reached — 350M trains only 43/5000 steps	Critical	CONFIG FIXED	Root cause: 22K seqs, batch=4, accum=128 → 43 steps/epoch, max_steps=5000 unreachable. Fix: C-TRAINCFG-001 contract + v2 config (`pretrain-350m-v2.yaml`) with 68K seqs, accum=1, steps_per_epoch=16994 >= 5000. v1 config also fixed with epochs=117. V2 training partially completed (ALB-063).

| ALB-061 | #43 | albor docs | Monolithic spec stale — diverges from mdBook chapters | Medium | FIXED | scripts/generate-spec.sh regenerates docs/specifications/albor-llm-spec.md from mdBook chapters. make spec target added. | | ALB-062 | #44 | albor docs | Stale spec chapters — §3 VRAM, §15/18 blockers, §16 repro, model card, intro | Medium | FIXED | All chapters updated to match reality: VRAM budget, ALB-025/037 no longer blockers, v2 pipeline in §16, ALB-060 context in model card and introduction. | | ALB-063 | #45 | albor training | Retrain 350M with v2 config (corrected epochs + expanded data) | Critical | IN PROGRESS | ALB-069→072 all fixed. Training running: PID 1775202, ~4.4s/step (934 tok/s), save_interval=250, 5000 steps, ~11.8 GB VRAM. Loss 10.40→7.13 (step 169)→6.77 (step 338). Step 250 eval: val_loss=6.92, val_ppl=1008. Step 500 checkpoint verified OK (1520 MB). gnorm stable 2-9 range. | | ALB-064 | #46 | albor / entrenar | Training process dies silently — no crash detection, no watchdog, no recovery | Critical | FIXED | scripts/train-guard.sh: crash-resilient supervisor with exit code classification, GPU state capture, structured JSON crash reports, exponential backoff restart, heartbeat monitoring, pre-flight GPU health checks. Auto-diagnostic mode: detects async CUDA crash pattern, enables CUDA_LAUNCH_BLOCKING=1 on restart. Five Whys: CUDA driver crash → SIGABRT/SIGSEGV → bypasses Rust panic handler → no stderr output → no diagnosis. Root cause: ALB-065. | | ALB-065 | #47 | entrenar / trueno | Missing stream.synchronize() before D2H gradient transfers — async CUDA crash | Critical | FIXED | compute_workspace_clip_scale() and compute_clip_scale() call cuMemcpyDtoH without synchronizing the non-blocking CUDA stream. cuMemcpyDtoH only synchronizes with the default stream, but trueno creates streams with CU_STREAM_NON_BLOCKING. Result: backward kernels not finished when gradient buffers are read → garbage clip scale → NaN/crash. Fix: stream.synchronize() at 3 locations before D2H transfers (entrenar@d3a3d26). |

| ALB-066 | #48 | albor config | gradient_accumulation: 128 makes training take 68.8 days on single GPU | Critical | FIXED | CudaTransformerTrainer does per-sequence optimizer updates (per-block interleaved backward+optimize). gradient_accumulation just increases sequences per “step” without changing update granularity. Fix: reduced 128→16→1, epochs from 38→5→1. New estimate: ~11.7h at 480 tok/s. | | ALB-067 | #49 | entrenar / trueno | Per-block weight gradient clipping CPU bottleneck — 864 D2H transfers/step | High | FIXED (via ALB-078) | compute_workspace_clip_scale downloaded 9 buffers × 24 blocks × 4 seqs = 864 D2H transfers/step. Workaround: disabled per-block clipping (entrenar@eaadbc6). Proper fix: ALB-078 fused GPU clip pipeline (zero D2H, zero sync). grad_clip: 1.0 re-enabled in v3 config. | | ALB-068 | #50 | entrenar | save_interval dead code — no intermediate checkpoint saving during CUDA training | Critical | FIXED | save_interval read from config, validated, but never used in train_loop_cuda(). Checkpoints only saved at training completion. 24h crash = total loss. Fix: manual batch loop with trainer.save() at save_interval boundaries (entrenar@d8dfab7). | | ALB-069 | #51 | trueno | PTX selp_f32 argument order bug in fused cross-entropy kernels — training produces loss=0.0 | Critical | FIXED | selp_f32(pred, true_val, false_val) called as selp_f32(grad_target, grad_nontarget, is_target) — f32 values in pred slot, predicate in false_val slot. PTX JIT fails: “Arguments mismatch for instruction ‘selp’”. Same class as ALB-059 (constructor arg ordering). Fix: selp_f32(is_target, grad_target, grad_nontarget) at both call sites (trueno@10bec89, trueno#156). | | ALB-070 | #52 | entrenar / albor config | save_interval YAML field ignored — bridge reads checkpoint.save_every, default=1 causes eval every step | Critical | FIXED | YAML bridge reads training.checkpoint.save_every, not training.save_interval. Default=1 → validation eval runs every step → eval_batch() crashes on long sequences (missing max_seq_len truncation). Two fixes: (1) YAML config moved to checkpoint.save_every: 25 (2) eval_batch() now truncates to max_seq_len (entrenar@5c4c2d8). Same class as ALB-060 (config field mismatch). | | ALB-071 | #53 | entrenar | Embed gradient clipping disabled when grad_clip=None — NaN weights, loss=0.0 by step ~100 | Critical | FIXED | C-EMBED-GRAD-001 was gated behind if let Some(max_norm) = max_grad_norm. ALB-067 disabled grad_clip → embed activation gradients unclipped → CPU AdamW overflow → 304K NaN in embeddings, block weights ALL NaN. Fix: always clip with unwrap_or(1.0) + always compute LM head grad norm for observability (entrenar@d07d67d). Same class as ALB-044. | | ALB-072 | #54 | entrenar | fp16 loss scaling causes NaN in early layers — gradient overflow in f32 backward | Critical | FIXED | fp16 GradScaler (scale=65536) multiplied into fused CE kernel’s loss_scale. All backward uses f32 GpuBuffers — no fp16 underflow risk, but 65536x scaling caused activation gradient overflow by layers 0-1. Five Whys: loss=0.0 → NaN blocks 0-1 → first optimizer step NaN → FP32 works/FP16 doesn’t → unnecessary 65536x scaling. Fix: exclude grad_scaler.scale() from loss_scale (entrenar@44d3e74). gnorm now matches FP32 baseline (2.29). | | ALB-073 | #55 | trueno | fused_cross_entropy PTX selp argument mismatch — JIT compilation failure | High | FIXED | Same class as ALB-069. selp_f32(true_val, false_val, pred) instead of (pred, true_val, false_val) in fused cross-entropy kernel. Training fell back to non-fused path. Fix: trueno@10bec89. | | ALB-074 | #56 | entrenar | Buffer overflow — 2048-token seq hits 1024-sized GPU buffer during eval | Critical | FIXED | Stale binary missed ALB-070 eval truncation fix. 2048-token pretokenized sequence passed to eval_single_sequence without max_seq_len truncation → slice overflow at cuda_trainer.rs:711 (2096128 > 1048576). Crashed at step 1183. Fix: binary rebuild with entrenar@5c4c2d8. |

11.5 Performance Optimization Gaps

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-075	#57	trueno / entrenar	cuBLAS tensor core GEMM integration — replaced PTX GEMMs with TF32 tensor cores	Critical	FIXED	trueno-gpu 0.4.24 (cuBLAS FFI, PR #165 merged), entrenar PR #233 merged. Measured: 1,485 tok/s (4.3% MFU), 1,379ms/step, 3.19x end-to-end speedup. Kernel-level: 74-142 TFLOP/s vs 4.8-6.1 PTX (12-27x). Contract: `cublas-gemm-v1.yaml`.
ALB-076	#58	entrenar	Forward RMSNorm per-row kernel launch — 97.1% of GPU time	Critical	FIXED	`rms_norm_forward()` launched one 32-thread kernel per row (2048 launches/norm × 49 norms = 100,352 launches/step). nsys profiling: 46.6s/50 steps, avg 9.3μs each. Fix: switched to `BatchedVectorizedRmsNormKernel` (single launch, 256 threads, `blockIdx.y` batch dispatch). entrenar PR #238 merged. Measured: forward 347ms→14ms (24.8×), step 1357ms→339ms (4×), MFU 4.4%→17.5% (4×).
ALB-077	trueno #170, entrenar #239	trueno / entrenar	cuBLAS tensor core GEMM produces NaN for transposed backward GEMMs	Critical	FIXED	`CUBLAS_GEMM_DEFAULT_TENSOR_OP` outputs ALL NaN for Trans/NoTrans and NoTrans/Trans operations when gradient magnitudes reach ~1e5 (block 18 of 24-layer backward). Forward NoTrans/NoTrans unaffected. Five Whys: gradient magnification through 24 layers triggers undocumented tensor core numerical fault. Fix: `CUBLAS_DEFAULT_MATH` + `CUBLAS_COMPUTE_32F` + `CUBLAS_GEMM_DEFAULT` (no tensor cores, SIMD path). Phase 5a (TF32) reverted. Measured: 5,216 tok/s (15.1% MFU), 5.9× over PTX baseline, 0 NaN.

| ALB-078 | trueno #171, entrenar #240 | trueno / entrenar | Fused GPU gradient clipping — eliminate 26 stream syncs/step | High | IMPLEMENTED | Per-block clip calls stream.synchronize() + D2H 24×/step. New kernels: ClipScaleReduceKernel (single-CTA norm+clip_scale on GPU), GradientClipGpuScaleKernel (element-wise clip reading scale from GPU memory). Pipeline: 9× squared_sum_launch_into → 1× clip_scale_reduce → 9× gradient_clip_gpu_scale. Zero sync, zero D2H. IEEE 754 handles zero-norm (div→+inf, min→1.0). Compiles, awaiting dogfood. Expected: ~20% step time reduction. |

11.6 Training Quality Gaps

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-079	entrenar #241	entrenar	CUDA trainer ignores lr_scheduler — constant lr after warmup	Critical	FIXED	`CudaTransformerTrainer::current_lr()` only had linear warmup; returned constant `base_lr` after warmup. YAML `lr_scheduler: "cosine"` parsed but never applied. Five Whys: val_loss plateau at 6.92 + gnorm collapse 3.0→0.13 at constant lr. Fix: cosine decay using `max_steps` + `set_lr()` for CPU embed optimizer (`entrenar@297308d`, PR #241). v4 training launched with cosine decay active.
ALB-080	albor #61	albor config	Effective batch size 48-128x too small for 350M training	Critical	FIXED	4,096 tokens/step vs comparable runs: CodeParrot-small 196K, GPT-2 524K. Root cause: `gradient_accumulation: 1` in v3 config. Fix: v4 config with `gradient_accumulation: 32` → 131K tokens/step. Same wall-clock, 32x better gradient quality. Target: val_ppl < 100 by 1B tokens. v3 stopped at step 28K (val_ppl=1018, plateau); v4 launched with both fixes.

11.7 Data Pipeline Gaps

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-081	aprender#418, realizar#136	aprender	Streaming APR import + mmap reader — eliminate OOM on large models	Critical	FIXED	`apr import` loaded entire 67GB model into RAM (134GB as F32) → swap storm. `apr tensors` loaded entire .apr into `Vec<u8>` → 89GB RSS. Five Whys: no streaming write path, no mmap read path. Fix: `AprV2StreamingWriter` (temp file, peak RAM ~5GB), `MappedFile` + `AprV2ReaderRef` for reading (10.9MB RSS on 67GB file). Contract: `streaming-reader-v1.yaml`, FALSIFY-MMAP-001 verified.

Gaps are added as they are discovered during implementation and dogfooding.

12. Provable Quality & Design by Contract

Every computational kernel used in Albor must have a provable-contracts YAML specification with Popperian falsification tests, property-based probar tests, and Kani bounded model checking harnesses. This is not optional — it is a first-class deliverable alongside the model.

12.1 Verification Ladder

Four levels of assurance, from cheapest to most rigorous:

Level 4: Kani bounded model check    ─── PROOF (exhaustive for inputs ≤ N)
Level 3: probar property tests       ─── HIGH CONFIDENCE (10,000+ random inputs)
Level 2: Falsification tests         ─── TARGETED (specific edge cases)
Level 1: Type system                 ─── BY CONSTRUCTION (Rust compiler)
Level 0: Code review                 ─── HUMAN (necessary but insufficient)

Requirement: Every kernel reaches at least Level 3. Critical kernels (softmax, attention, cross-entropy, KD loss) reach Level 4.

12.2 Contract Registry for Albor

Albor requires contracts for every kernel in the training + post-training pipeline. Many already exist in provable-contracts; new ones must be written.

Existing Contracts (bind to aprender implementations)

Contract	Equations	Obligations	Status
`softmax-kernel-v1.yaml`	softmax	6 (normalization, positivity, monotonicity, SIMD parity, translation invariance, bound)	Exists, 289 bindings
`rmsnorm-kernel-v1.yaml`	RMSNorm	5 (finiteness, scale invariance, SIMD parity, idempotency)	Exists
`attention-kernel-v1.yaml`	scaled dot-product attention	Multiple (causal mask, score bounds, gradient flow)	Exists
`rope-kernel-v1.yaml`	Rotary Position Embedding	Multiple (rotation invariant, frequency spectrum)	Exists
`gelu-kernel-v1.yaml`	GELU activation	Bound, monotonicity, SIMD parity	Exists
`matmul-kernel-v1.yaml`	matrix multiplication	Associativity, SIMD parity, bound	Exists
`cross-entropy-kernel-v1.yaml`	cross-entropy loss	Non-negativity, gradient correctness	Exists
`adamw-kernel-v1.yaml`	AdamW optimizer	Bias correction, weight decay decoupling	Exists
`gqa-kernel-v1.yaml`	Grouped Query Attention	Equivalence to MHA when groups=heads	Exists
`swiglu-kernel-v1.yaml`	SwiGLU FFN	Gating invariants	Exists

New Contracts Required for Albor (ALB-013 through ALB-017)

Contract (NEW)	Key Equations	Key Obligations	Priority
`knowledge-distillation-kernel-v1.yaml`	KD_loss = α·KL(σ(z_t/T) ∥ σ(z_s/T))·T² + (1-α)·CE(y, z_s)	KL non-negativity, temperature scaling invariant, gradient correctness, α interpolation bound	Critical
`bpe-tokenizer-kernel-v1.yaml`	BPE merge rules, byte-pair encoding	Roundtrip invariant: decode(encode(x)) = x, vocab coverage, merge ordering	High
`model-merging-kernel-v1.yaml`	SLERP: interp(θ, w₁, w₂) on unit sphere; TIES: trim + elect + disjoint merge	SLERP interpolation bound (‖result‖ ≈ 1), TIES sparsity guarantee	Medium
`pruning-kernel-v1.yaml`	WANDA: score =	w	· ‖x‖₂; magnitude: score =
`gradient-accumulation-kernel-v1.yaml`	G_accum = (1/N)·Σ g_i ≈ g_full	Numerical equivalence within tolerance, loss scaling correctness	High
`training-config-kernel-v1.yaml`	steps_per_epoch, total_achievable_steps, LR warmup coverage, Chinchilla tokens	Epoch sufficiency for max_steps, warmup completion, peak LR reached, data sufficiency	Critical

12.3 Contract Workflow for Each Kernel

# 1. Write or validate YAML contract
pv validate contracts/knowledge-distillation-kernel-v1.yaml

# 2. Generate trait stubs + failing tests
pv scaffold contracts/knowledge-distillation-kernel-v1.yaml

# 3. Generate property-based tests (wired to actual aprender code)
pv probar contracts/knowledge-distillation-kernel-v1.yaml \
  --binding contracts/aprender/binding.yaml

# 4. Generate Kani bounded model checking harnesses
pv kani contracts/knowledge-distillation-kernel-v1.yaml

# 5. Run falsification sweep
pv audit contracts/knowledge-distillation-kernel-v1.yaml \
  --binding contracts/aprender/binding.yaml

# 6. Verify full contract status
pv status contracts/knowledge-distillation-kernel-v1.yaml

12.4 Falsification Tests: Albor-Specific

Every claim in this specification must be falsifiable. Below are the concrete falsification tests for Albor’s key properties.

Training Correctness

# FALSIFY-ALBOR-001: Loss decreases monotonically (smoothed)
- id: FALSIFY-ALBOR-001
  rule: "Training convergence"
  prediction: "EMA(loss, window=100) is monotonically decreasing after warmup"
  test: "Load training log, compute EMA, assert no sustained increase >5% over 500 steps"
  if_fails: "Learning rate too high, data corruption, or gradient computation bug"

# FALSIFY-ALBOR-002: Gradient norms are bounded
- id: FALSIFY-ALBOR-002
  rule: "Training stability"
  prediction: "Global gradient norm < 10.0 after clipping for all steps"
  test: "Parse training log, assert max gradient norm across all steps"
  if_fails: "Gradient clipping not applied, loss spike, or NaN propagation"

# FALSIFY-ALBOR-003: Checkpoint determinism
- id: FALSIFY-ALBOR-003
  rule: "Reproducibility"
  prediction: "Two runs with seed=42 produce identical checkpoints at step 1000"
  test: "Train twice, BLAKE3 hash both checkpoints, assert equality"
  if_fails: "Non-deterministic operation (async GPU, HashMap ordering, etc.)"

Distillation Correctness

# FALSIFY-ALBOR-004: KL divergence is non-negative
- id: FALSIFY-ALBOR-004
  rule: "KD loss validity"
  prediction: "KL(teacher || student) >= 0 for all batches"
  test: "proptest with 10000 random logit pairs, assert KL >= -1e-7"
  if_fails: "Log-domain computation error or softmax numerical instability"

# FALSIFY-ALBOR-005: Distillation improves over base
- id: FALSIFY-ALBOR-005
  rule: "Distillation value"
  prediction: "albor-distill avg benchmark > albor-base avg benchmark"
  test: "Run full eval suite on both, paired t-test with p < 0.05"
  if_fails: "Teacher logits corrupted, temperature too high/low, or alpha miscalibrated"

# FALSIFY-ALBOR-006: Teacher logit integrity
- id: FALSIFY-ALBOR-006
  rule: "Data pipeline integrity"
  prediction: "Pre-computed teacher logits match live teacher inference within 1e-4"
  test: "Sample 100 batches, run live teacher inference, compare against stored logits"
  if_fails: "Serialization precision loss, wrong batch ordering, or teacher model mismatch"

Post-Training Invariants

# FALSIFY-ALBOR-007: Merge interpolation bound
- id: FALSIFY-ALBOR-007
  rule: "SLERP correctness"
  prediction: "‖SLERP(w1, w2, t)‖ ≈ ‖w1‖ for t ∈ [0,1] (unit sphere)"
  test: "proptest with 10000 random weight pairs and t values"
  if_fails: "SLERP implementation uses LERP instead, or normalization missing"

# FALSIFY-ALBOR-008: Pruning sparsity guarantee
- id: FALSIFY-ALBOR-008
  rule: "WANDA correctness"
  prediction: "Exactly 50% of weights are zero after prune --sparsity 0.5"
  test: "Count zero weights, assert within ±0.1% of target sparsity"
  if_fails: "Pruning threshold computation error or layer exclusion bug"

# FALSIFY-ALBOR-009: Quantization round-trip
- id: FALSIFY-ALBOR-009
  rule: "Q4 fidelity"
  prediction: "Perplexity(Q4 model) < 1.05 × Perplexity(fp16 model)"
  test: "Evaluate both on held-out set, assert ratio < 1.05"
  if_fails: "Quantization calibration data insufficient or block size wrong"

12.5 Brick Profiling Architecture

Training a 350M model on a single 4090 is a systems engineering problem, not a scaling problem. Every watt of GPU silicon must be accounted for. The architecture achieves this by treating each component as a brick — a self-contained unit with measurable inputs, outputs, and a provable contract.

12.5.1 Three Granularities of Profiling

Per-kernel. Every CUDA kernel (gemm_forward, silu_backward, rms_norm_forward, batched_transpose_forward, etc.) is individually measurable via compute-sanitizer, nsys, or nvprof. When a kernel misbehaves, the brick boundary isolates the failure to a single function with known input/output shapes. The contract for each kernel specifies buffer size invariants that can be checked statically.

Per-block. CudaTransformerBlock encapsulates one transformer layer’s forward, backward, and optimizer step as a single GPU-resident unit. Diagnostic sampling after backward (downloading 1K elements from each gradient buffer) immediately distinguishes “math is wrong” (NaN in gradients) from “math is right but magnitudes are wrong” (gradient explosion). The brick boundary separates kernel correctness from training dynamics.

Per-transfer. The 3-transfer-per-step contract (C-GPUTRAIN-002) fixes the PCIe budget:

Transfer 1 (H2D): embedding hidden states   ~S×H×4 bytes
Transfer 2 (D2H): logits for cross-entropy  ~S×V×4 bytes
Transfer 3 (H2D): grad_logits to GPU        ~S×V×4 bytes

Any deviation from 3 transfers is a bug, not a tuning knob. For 350M at seq=2048: total ~544 MB/step, overhead ~17 ms on PCIe 4.0 x16 — under 5% of compute time.

12.5.2 Chain of Thought: How Brick Boundaries Diagnose Bugs

When a training run fails, the brick architecture converts “something is broken” into a structured diagnosis:

Which granularity? Check per-transfer (D2D size mismatch?), per-block (which layer’s backward fails?), per-kernel (which GEMM overflows?).
Local or global? If one block fails and others succeed, the bug is in that block’s kernels. If all blocks succeed but loss diverges, the bug is in training dynamics (LR, grad clipping, optimizer config).
Static or dynamic? Buffer overflow is a static invariant violation (detectable by algebraic dimension checking). Gradient explosion is a dynamic stability issue (detectable by runtime sampling).

12.5.3 Five Whys: From Symptom to Root Cause

The brick architecture enforces a disciplined root-cause chain. Concrete example from dogfooding:

Why	Finding	Brick boundary
Why does 350M training produce NaN at step 2?	Gradients reach 1e35, AdamW produces NaN weights	Per-block sampling: `grad_gate max=3.28e35`
Why are gradients 1e35?	24-layer backward amplifies without clipping	Per-transfer: config has `grad_clip: 1.0` but CUDA path ignores it
Why no gradient clipping in CUDA path?	`CudaTransformerTrainer` copied from finetuning (pre-trained weights, small grads)	Brick mismatch: finetuning brick assumed well-conditioned weights
Why wasn’t this caught by the GPU training contract?	Contract validates kernel correctness + transfer count, not training stability	Contract gap: no `C-TRAINSTABLE-001` obligation
Why doesn’t the contract cover stability?	Contracts target kernel-level (local) correctness, not loop-level (global) dynamics	Action: add training-stability contract bridging kernel and loop levels

This same pattern resolved four bugs during ALB-040 dogfooding:

Bug	Profiling diagnosis	Contract that prevents recurrence
ALB-043: `silu_backward` writes `[S,I]` into `[S,H]` buffer (4x overflow)	`compute-sanitizer` pinpoints illegal address in `silu_backward`	Buffer size invariant: output must be `[S, intermediate_size]`
ALB-041: D2D copy size mismatch in `backward_attention`	Error logged at exact block index; `gate_out` used as `grad_hidden` temp	D2D invariant: `src.len() == dst.len()` for `copy_from_buffer_async`
`backward_attention`: transpose `attn_scores [H,S,S]` into `attn_kv_temp2 [H,S,hd]`	Algebraic trace: `16×512×512 = 4.2M` into 524K buffer = 8x overflow	Transpose output buffer invariant: `output.len() >= batch × rows × cols`
`gpu_forward`: D2D copy fails when `seq_len < max_seq_len`	All forwards return None; traced to `PAR-023` size mismatch	Forward buffer invariant: input/output buffers at `max_seq_len` size
ALB-044: Unclipped activation gradient (~1e35) overflows CPU AdamW	Per-boundary sampling: embed weights have 1298 NaN after optimizer step	C-EMBED-GRAD-001: clip activation gradient at GPU→CPU boundary
ALB-044: CPU AdamW beta2=0.999 vs YAML beta2=0.95 (50x amplification)	Traced bias correction: v_hat = v/0.001 with beta2=0.999 vs v/0.05 with 0.95	C-HYPERPARAMS-001: all optimizer fields must match YAML config
ALB-059: GEMM backward constructor args n/k swapped — output stride 64× too large	Per-kernel: v_w_k[block0] corrupted during `gemm_backward_a(LM head)`. Pointer analysis: 3 contiguous 256KB allocs. Stride 32768 writes rows into m_w_k/v_w_k.	C-GEMMARGS-001: kernel constructor args must match documented parameter order
ALB-059: Uninitialized optimizer m/v buffers (cuMemAlloc returns garbage)	Per-block: v_w_k nonzero before any backward op (not from overflow). `GpuBuffer::new()` ≠ zero-init.	C-GPUINIT-001: all optimizer state buffers must be zero-initialized
ALB-065: Missing stream.synchronize() before D2H gradient transfers	Per-transfer: cuMemcpyDtoH reads stale GPU buffers. Process stable with CUDA_LAUNCH_BLOCKING=1, crashes within 15s without it. Five Whys: trueno uses CU_STREAM_NON_BLOCKING; cuMemcpyDtoH doesn’t sync with non-blocking streams.	C-STREAMSYNC-001: stream.synchronize() before every D2H transfer reading kernel output

12.5.4 How Bricks and Contracts Interlock

The gap register (§11) is the feedback loop between profiling and contracts:

Brick profiling finds anomaly
  → File gap (ALB-0XX)
    → Write or update contract obligation
      → Fix upstream brick
        → Verify contract passes (`pv audit`)
          → Dogfood in albor pipeline
            → Close gap

Profiling finds bugs that contracts miss (runtime-only issues like gradient explosion). Contracts prevent bugs that profiling misses (the 50M model’s 2x buffer overflow “worked” through undefined behavior — only a static size invariant would have caught it). Together they form a ratchet: every bug found by profiling becomes a permanent contract obligation that prevents recurrence.

12.6 Verification DAG (Albor End-to-End)

Like the Qwen 3.5 verification DAG in provable-contracts, Albor composes sub-contracts into a full model verification:

softmax ← attention ← gqa
                        ↑
rmsnorm ──────────────── albor-forward ← training-loop
                        ↑                      ↑
gelu ← swiglu ──────────┘                     │
                                               │
rope ──────────────────── albor-forward        │
                                               │
matmul ← gqa                                   │
                                               │
cross-entropy ─────────── training-loss ────────┘
                              ↑
adamw ─────────── optimizer-step ──────── training-loop
                                               │
gradient-accumulation ─────────────────────────┘
                                               │
training-config ─── config-validation ─────────┘
                                               │
knowledge-distillation ── distill-loss ── distill-loop
                              ↑
bpe-tokenizer ─── data-pipeline ─── training-loop

model-merging ─── post-training ─── albor-merged
pruning ────────── post-training ─── albor-pruned

Each node in this DAG is a contract. pv graph contracts/ --format mermaid renders the full dependency graph. A change to any sub-contract triggers re-verification of all dependents.

12.7 Training Stability Contracts

The kernel-level contracts in §12.2 verify local correctness — each kernel produces the right output for its input. They do NOT verify global training stability — that the training loop converges without NaN, that hyperparameters propagate correctly, or that gradients flow to all parameters.

ALB-038, ALB-041, ALB-043, and ALB-044 all passed kernel-level contracts while producing training failures. These contracts bridge the gap between kernel correctness and training correctness.

C-TRAINSTABLE-001: Training Stability

All weights and loss must remain finite for the entire training run.

obligations:
  - "loss.is_finite() for all steps"
  - "weight[i].is_finite() for all i, all steps"
  - "grad[i].is_finite() for all i after clipping, all steps"
falsification: |
  FALSIFY-STABLE-001: Train 100 steps on random init.
  Assert loss.is_finite() at every step.
  Assert no NaN in any model weight after every optimizer step.

C-EMBED-GRAD-001: Activation Gradient Clipping at GPU-CPU Boundary

When GPU backward produces activation gradients that flow to a CPU optimizer, those gradients must be clipped to max_grad_norm before the CPU processes them.

Status: VERIFIED — 350M CUDA test (50 steps) produces zero NaN in embedding weights. Fix in entrenar@86eec38.

motivation: |
  Per-block gradient clipping in CudaGradWorkspace only clips WEIGHT gradients.
  The ACTIVATION gradient in grad_buf_a/b flows unclipped to the CPU embedding
  optimizer. For 24-layer random init, this gradient reaches ~1e35 — overflowing
  the CPU AdamW second moment buffer.
obligation: |
  Before scatter-adding activation gradients into CPU embedding weight gradient:
    grad_norm = L2_norm(activation_grad)
    if grad_norm > max_grad_norm:
        activation_grad *= max_grad_norm / grad_norm
falsification: |
  FALSIFY-EMBEDGRAD-001: Train 350M model (24 layers) for 5 steps.
  Assert embedding weights contain zero NaN values after each optimizer step.

C-HYPERPARAMS-001: Optimizer Hyperparameter Propagation

Every optimizer hyperparameter in the YAML config must reach the actual optimizer constructor. No implicit defaults.

Status: VERIFIED — 350M CUDA test uses explicit AdamW::new() with YAML config values (beta2=0.95, wd=0.1). Fix in entrenar@86eec38.

obligation: |
  For every optimizer in the training loop (GPU AdamW, CPU AdamW, LM head AdamW):
    assert optimizer.lr == config.lr (adjusted for warmup)
    assert optimizer.beta1 == config.beta1
    assert optimizer.beta2 == config.beta2
    assert optimizer.weight_decay == config.weight_decay
    assert optimizer.epsilon == 1e-8 (or config.epsilon if specified)
falsification: |
  FALSIFY-HYPERPARAMS-001: Construct CudaTransformerTrainer with non-default
  YAML config (beta2=0.95, wd=0.1). Verify CPU embed_optimizer.beta2 == 0.95
  and embed_optimizer.weight_decay == 0.1 (not 0.999 and 0.01).
anti_pattern: |
  NEVER: AdamW::default_params(lr)  — hides beta2, wd, epsilon
  ALWAYS: AdamW::new(lr, beta1, beta2, epsilon, wd)  — explicit from config

C-BUFSIZE-001: CUDA Kernel Buffer Size Invariants

Every GPU buffer passed to a CUDA kernel must have algebraically verifiable size matching the kernel’s expected dimensions.

obligation: |
  For gemm_forward(A, B, C, M, K, N):
    assert A.len() >= M * K
    assert B.len() >= K * N
    assert C.len() >= M * N
  For silu_backward(input, grad_output, output):
    assert output.len() >= input.len()
  For rms_norm_backward(input, weight, grad_output, grad_input, grad_weight, S, H):
    assert grad_input.len() >= S * H
    assert grad_weight.len() >= H
falsification: |
  FALSIFY-BUFSIZE-001: Run compute-sanitizer on 10-step 50M training.
  Assert zero illegal address errors.
anti_pattern: |
  NEVER: Reuse a buffer sized for hidden_size as temp for intermediate_size
  ALWAYS: Use dedicated buffers or verify size >= required before kernel call

C-GEMMARGS-001: GEMM Kernel Constructor Argument Ordering

Every GEMM kernel constructor call must pass arguments in the exact order documented by the kernel’s API. Compile-time stride constants baked into PTX are determined by constructor args — wrong order produces wrong strides, not wrong results at the kernel boundary (bounds check passes but data lands in wrong memory).

Status: VERIFIED — 350M CUDA test (50 steps) produces correct backward gradients. Fix in entrenar@846ae0c.

motivation: |
  GemmBackwardAKernel::tiled_unrolled(m, n, k, tile_size) bakes self.n and
  self.k as immediate PTX constants for row/col strides. When called as
  tiled_unrolled(m, k, n, tile) with k and n swapped, the output stride
  becomes vocab_size (32768) instead of hidden_size (512) — writing output
  rows 64× too far apart and overflowing into adjacent GPU allocations.
obligation: |
  For every kernel constructor call:
    assert arg_order matches constructor signature exactly
  Specifically for GEMM backward:
    GemmBackwardAKernel::tiled_unrolled(m, n, k, tile)  # NOT (m, k, n, tile)
    GemmBackwardBKernel::tiled_unrolled(m, n, k, tile)  # NOT (m, k, n, tile)
falsification: |
  FALSIFY-GEMMARGS-001: Train 350M model for 5 steps. Download v_w_k[block0]
  after backward. Assert zero corruption (all values ≥ 0 after optimizer init,
  no values from adjacent buffers).
anti_pattern: |
  NEVER: Guess argument order from variable names (m/n/k are ambiguous)
  ALWAYS: Check constructor signature in trueno-gpu kernel source

C-GPUINIT-001: GPU Buffer Zero Initialization

All optimizer state buffers (m and v for AdamW) must be zero-initialized. GpuBuffer::new() uses cuMemAlloc which returns uninitialized VRAM — the contents are whatever was previously in that memory region.

Status: VERIFIED — All 34 optimizer buffers (18 per-block + 12 LoRA + 4 LM head/norm) zero-initialized via GpuBuffer::from_host(&ctx, &vec![0.0f32; n]). Fix in entrenar@846ae0c.

obligation: |
  For every GpuBuffer used as optimizer state (m, v):
    assert buffer is zero-initialized after allocation
    Use GpuBuffer::from_host(&ctx, &vec![0.0f32; n])
    NOT GpuBuffer::new(&ctx, n)  -- returns uninitialized VRAM
falsification: |
  FALSIFY-GPUINIT-001: Allocate optimizer state, download immediately.
  Assert all values == 0.0.

C-GRADFLOW-001: Gradient Flow Verification

Every trainable parameter must receive a non-zero gradient after one forward+backward step on a non-trivial batch.

obligation: |
  After one forward+backward step on a batch with non-constant inputs:
    for param in model.trainable_parameters():
      assert param.grad().abs().max() > 0
falsification: |
  FALSIFY-GRADFLOW-001: Train 1 step on 50M model with random init.
  Verify all 110 parameter tensors have max(|grad|) > 0.
anti_pattern: |
  NEVER: Create tensors with requires_grad=false in the forward path
  NEVER: Use ops that don't register backward (e.g., manual array copies)
  ALWAYS: Verify gradient flow when adding new layers or ops

C-TRAINCFG-001: Training Configuration Algebraic Consistency

Every training configuration must be algebraically validated BEFORE GPU time is consumed. The epoch/step/data/LR relationship must be provably sufficient.

Status: VERIFIED — ALB-060 config fixed. C-TRAINCFG-001 contract written (contracts/training-config-kernel-v1.yaml), v1 config fixed (epochs: 117), v2 config proven correct (steps_per_epoch = 16994 >= 5000 with expanded 68K dataset). V2 training (ALB-063) reached step ~1183/5000 with loss 10.4→6.9, confirming warmup completes and LR reaches peak 3e-4.

motivation: |
  ALB-060: pretrain-350m.yaml had epochs=1 with 22K sequences and grad_accum=128.
  steps_per_epoch = floor(22079 / 4 / 128) = 43. max_steps=5000 unreachable.
  warmup_steps=2000 never completed. LR peaked at 6.45e-6 (target 3e-4).
  Loss flat at ~10.39 for all 43 steps. Checkpoint contains untrained weights.
  Total wasted: ~12 seconds GPU + debugging time. Contract prevents recurrence.
equations:
  - "steps_per_epoch = floor(num_sequences / batch_size / grad_accum)"
  - "total_achievable_steps = num_epochs × steps_per_epoch"
  - "total_achievable_steps >= max_steps  (HARD REQUIREMENT)"
  - "warmup_steps < total_achievable_steps  (warmup must complete)"
  - "warmup_fraction = warmup_steps / actual_total_steps <= 0.10"
  - "min_epochs = ceil(max_steps / steps_per_epoch)"
  - "total_tokens = actual_steps × batch_size × grad_accum × seq_len"
obligations:
  - "Epoch count sufficient: num_epochs >= ceil(max_steps / steps_per_epoch)"
  - "Warmup completes: warmup_steps < actual_total_steps"
  - "Peak LR reached: exists step t where lr(t) = lr_peak"
  - "Training tokens sufficient: total_tokens >= 10 × num_params"
falsification: |
  FALSIFY-CFG-001: Compute steps_per_epoch for pretrain-350m.yaml.
  With 22079 seqs, batch=4, accum=128: steps_per_epoch=43.
  Assert 1 × 43 < 5000 (proves epochs=1 is insufficient).
  FALSIFY-CFG-002: Assert warmup_steps (2000) > total_steps (43)
  (proves warmup never completes with epochs=1).

Full contract: contracts/training-config-kernel-v1.yaml — 7 equations, 8 proof obligations, 5 falsification tests, 2 Kani harnesses.

C-STREAMSYNC-001: Stream Synchronization Before D2H Transfers

Every cuMemcpyDtoH (or copy_to_host_at()) call that reads data written by GPU kernels on a non-default stream MUST be preceded by stream.synchronize().

motivation: |
  ALB-065: gradient clipping downloaded 9 GPU buffers via cuMemcpyDtoH
  without stream synchronization. trueno CudaStream uses CU_STREAM_NON_BLOCKING;
  cuMemcpyDtoH only synchronizes with the default stream. Backward kernels
  hadn't finished → garbage clip scale → NaN → silent SIGABRT (process death
  with no error output). Training was stable with CUDA_LAUNCH_BLOCKING=1 but
  crashed within 15 seconds without it.
obligation: |
  stream.synchronize() MUST precede every cuMemcpyDtoH that reads kernel output.
  No exceptions. The sync ensures all prior kernel launches have completed.
falsification: |
  FALSIFY-GPU-008: Run 350M training for 50+ steps WITHOUT CUDA_LAUNCH_BLOCKING=1.
  Verify process stays alive, loss is finite, no CUDA errors in dmesg/Xid log.
anti_pattern: |
  NEVER: call copy_to_host_at() after kernel launches without stream.synchronize()
  NEVER: rely on cuMemcpyDtoH to synchronize non-blocking streams (it doesn't)
  DIAGNOSTIC: if training crashes without CUDA_LAUNCH_BLOCKING=1 but works with it,
  this is the FIRST contract to check

Full contract: contracts/training-gpu-kernel-v1.yaml — stream_synchronization equation + proof obligation.

12.7.1 Observability Discipline

All training observability MUST use the renacer tracing infrastructure.

entrenar integrates renacer in src/run.rs (span lifecycle: create_span, emit_metric_event, end_span). The src/monitor/drift.rs module provides anomaly detection (DriftStatus, AnomalySeverity) that can automatically flag NaN, gradient explosion, and loss divergence.

obligation: |
  NEVER: eprintln!(), println!(), dbg!() for training diagnostics
  ALWAYS: tracing::debug!(), tracing::warn!() with structured fields
  ALWAYS: emit_metric_event() for training metrics (loss, grad_norm, lr)
motivation: |
  Ad-hoc eprintln! creates cleanup debt, is invisible to tracing infra,
  loses brick profiling boundary isolation, and cannot be filtered at runtime.
  renacer BrickTracer provides structured, filterable, permanent observability.

13. pmat Compliance & Quality Gates

13.1 Scope: Where Quality Applies

Albor is a project repo (configs, scripts, contracts, docs). It produces no Rust library code. All quality gates apply to upstream Rust changes made in service of Albor’s gaps — not to albor’s shell scripts or YAML configs.

# Run on all modified stack components (NOT on albor itself)
pmat comply check --strict ../aprender      # ALB-001, 006, 009, 011
pmat comply check --strict ../entrenar      # ALB-003, 004
pmat comply check --strict ../trueno        # ALB-005
pmat comply check --strict ../realizar      # ALB-010
pmat comply check --strict ../alimentar     # ALB-007, 018, 019, 020
pmat comply check --strict ../repartir      # ALB-002, 008

13.2 Quality Gate Thresholds (Upstream Rust Code)

Gate	Threshold	Applies To	Enforcement
TDG Grade	A (score ≤ 1.0)	Upstream Rust	`pmat analyze tdg --include-components`
Test Coverage	≥ 95% line coverage	Upstream Rust	`cargo llvm-cov --summary-only`
Mutation Score	≥ 85%	Upstream Rust	`cargo mutants --no-times`
Cyclomatic Complexity	≤ 15 per function	Upstream Rust	`pmat analyze complexity`
File Length	≤ 500 lines	All Rust files (upstream)	`find . -name '*.rs' \| xargs wc -l`
SATD	Zero (no TODO/FIXME/HACK)	Upstream Rust	`pmat analyze satd`
Unwrap Calls	Zero in new code	Upstream Rust	`pmat query --literal "unwrap()" --faults`
Clippy	Zero warnings	Upstream Rust	`cargo clippy -- -D warnings`

13.3 Quality Gate Thresholds (Albor Repo)

Gate	Threshold	Applies To	Enforcement
File Length	≤ 500 lines	Scripts, YAML, contracts (not specs/docs)	`wc -l` on non-doc tracked files
FALSIFY-ALBOR tests	All 9 pass	Pipeline end-to-end	`batuta falsify .`
Contract completeness	All 5 new contracts at Level 3+	`contracts/`	`pv status contracts/`
Config validity	All YAML parses and `plan` passes	`configs/`	`apr pipeline plan` (validates all configs in one DAG pass)
Reproducibility	Same seed → same checkpoint hash	Full pipeline	FALSIFY-ALBOR-003

13.3 pmat Quality Commands for Albor

# TDG analysis of all Albor-touched code
pmat analyze tdg ../aprender --include-components
pmat analyze tdg ../entrenar --include-components

# Find coverage gaps (highest ROI targets)
pmat query --coverage-gaps --limit 30 --exclude-tests

# Fault pattern audit (unwrap, panic, unsafe)
pmat query "training" --faults --exclude-tests

# Full quality audit on distillation code
pmat query "distill" --churn --duplicates --entropy --faults -G

# Complexity check on new kernels
pmat query "knowledge_distillation" --max-complexity 15 --include-source

# Create quality baseline before Albor work begins
pmat tdg baseline create

# Check for regressions after each phase
pmat tdg check-regression --baseline

13.5 Certeza Three-Tier Testing (Upstream Repos)

When modifying upstream Rust code for gap fixes, follow certeza tiers:

Tier 1: On-Save (sub-second)

cargo check && cargo test --lib -- --quiet    # Type check + unit tests

Tier 2: On-Commit (1-5 minutes)

cargo test                                     # Full test suite
cargo llvm-cov --summary-only                  # Coverage ≥ 95%
pmat analyze tdg                               # TDG regression check
pv audit contracts/ --binding                  # Contract compliance

Tier 3: On-Merge / Nightly (hours)

cargo mutants --no-times                       # Mutation score ≥ 85%
cargo kani                                     # Formal verification
batuta falsify . --min-grade toyota-standard   # 108-item checklist
pmat rust-project-score --full                 # Comprehensive quality score

13.6 Albor Pipeline Commands

Since albor is a project repo, its primary interface is apr pipeline. No Makefiles, no shell scripts. One manifest, one DAG.

# ── Pipeline (the only entry point you need) ──
apr pipeline plan configs/pipeline/albor.yaml     # Full DAG dry-run (no GPU, no writes)
apr pipeline apply configs/pipeline/albor.yaml    # Execute everything (resumable)
apr pipeline status                               # What's converged / pending / failed
apr pipeline drift                                # Detect unauthorized state changes

# ── Targeted execution (run one step + its dependencies) ──
apr pipeline apply configs/pipeline/albor.yaml --target train-350m
apr pipeline apply configs/pipeline/albor.yaml --target eval-code
apr pipeline apply configs/pipeline/albor.yaml --target publish

# ── Force re-run (ignore converged state) ──
apr pipeline apply configs/pipeline/albor.yaml --target distill --force

# ── Individual subcommands (for development / debugging) ──
apr train plan configs/train/pretrain-350m.yaml   # Plan one step standalone
apr train apply configs/train/pretrain-350m.yaml --seed 42
apr monitor ./checkpoints/albor-base-350m/        # Live TUI
apr experiment view --db .entrenar/experiments.db  # Browse experiments

# ── Quality (upstream repos — run independently of pipeline) ──
pmat tdg baseline create                          # TDG baseline across all components
pmat comply check --strict ../aprender
pmat comply check --strict ../entrenar
pv validate contracts/*.yaml                      # Contract schema validation
pv status contracts/                              # Contract completeness
batuta falsify . --min-grade toyota-standard      # 108-item falsification checklist
# Current score: 100.0% (108/108 PASS) — achieved 2026-03-04

14. Batuta Falsification Checklist

14.1 108-Item Popperian Assessment

The Albor project itself is subject to batuta’s 108-item falsification checklist:

# Full assessment
batuta falsify . --verbose --format markdown --output docs/falsification-report.md

# Critical-only (blocks release)
batuta falsify . --critical-only

# CI-friendly output
batuta falsify . --format github-actions --min-grade kaizen-required

14.2 Key Sections Applied to Albor

Section 1: Sovereign Data Governance (SDG)

All training data has documented provenance (HuggingFace commit SHAs)
No PII in training corpus (alimentar quality check)
Data residency: all data stored on owned hardware (lambda + intel)
Teacher model license verified (Apache 2.0)

Section 3: Hypothesis-Driven Development (HDD)

Each improvement stage has a falsifiable hypothesis:
- “Distillation improves avg benchmark by >5%” (FALSIFY-ALBOR-005)
- “Pruning at 50% sparsity degrades benchmarks by <2%” (FALSIFY-ALBOR-008)
- “Q4 quantization degrades perplexity by <5%” (FALSIFY-ALBOR-009)
Reproducibility standard: Gold (deterministic seeds, versioned data, BLAKE3 checkpoint hashes, Cargo.lock pinning)

Section 4: Numerical Reproducibility (NR)

Float determinism enforced via fixed seeds and operator ordering
Cross-platform consistency: checkpoint trained on lambda loads on intel
SIMD parity: all kernels have provable-contracts SIMD equivalence obligations

Section 5: Performance & Waste Elimination (PW)

Seven Wastes (Muda) applied to training pipeline:
- No redundant data copies (alimentar streaming)
- No idle GPU time (pre-computed teacher logits)
- No over-processing (progressive model sizing: 50M → 125M → 350M)

Section 6: Safety & Formal Verification (SF)

Critical kernels have Kani proofs (softmax, attention, cross-entropy)
New kernels (KD loss, gradient accumulation) get Kani harnesses

Section 10: Architectural Invariants (AI) — CRITICAL

AI-01: All model operations use apr (no manual weight manipulation)
AI-02: Every checkpoint is BLAKE3-hashed and version-tracked
AI-03: Training config is immutable once committed (no runtime overrides)
AI-04: Eval results are reproducible (fixed seed, deterministic batching)
AI-05: No undeclared dependencies (Cargo.lock enforced)

14.3 Current Grade

Perfect Score: 100.0% (108/108 PASS) — achieved 2026-03-04.

This exceeds the Toyota Standard (90-100%) target:

All 5 Critical items pass (Section 10)
All Major items pass
All Minor items pass
Zero PARTIAL, zero FAIL

Score progression across 14 MLOps survey batches: 34% → 100% (see entrenar/docs/specifications/world-class-mlops-survey.md).

15. Implementation Phases

Phase 0: Pipeline Manifest, Contracts & Quality Baseline (Week 1)

Write configs/pipeline/albor.yaml — full pipeline manifest (infra + data + train + eval + publish)
apr pipeline plan — validate entire DAG, estimate resources
apr pipeline apply --target cuda-driver --target vulkan-driver --target data-dir — provision infra
Verify trueno wgpu on W5700X via Vulkan (not Metal — Linux)
Verify trueno CUDA on 4090
Download Qwen3-Coder-Next to intel box, verify it loads in realizar
pmat tdg baseline create on all stack components
pv coverage contracts/ --binding — establish contract coverage baseline
batuta falsify . --critical-only — initial falsification assessment

Phase 1: Data Pipeline + Tokenizer Contract (Week 1-2)

Ingest local ground truth corpora via alimentar import local (fix ALB-019 if needed)
- depyler: examples/ + tdd-book/tests/ (~1,845 files, ~219K lines)
- hf-ground-truth-corpus (~11,928 files)
- jax-ground-truth-corpus (~2,697 files)
- vllm-ground-truth-corpus (~1,118 files)
Ingest local ML framework code (Tier 2, ~53K files)
Download external datasets via alimentar import hf (StarCoder Python, FineWeb-Edu)
Quality validation via alimentar quality check on all sources
Build weighted training mix with 10x upsampling on Tier 1 (fix ALB-020 if needed)
Write bpe-tokenizer-kernel-v1.yaml contract (ALB-014)
pv probar + pv kani on tokenizer contract
Train BPE tokenizer on mixed corpus (fix ALB-001 if needed)
Verify FALSIFY roundtrip: decode(encode(text)) = text for all test data
Tokenize all data into sharded Parquet
Apply FIM transforms to code sequences (fix ALB-018 if needed)
Create train/val/test splits via alimentar
Record SHA-256 hashes + provenance manifest for all data artifacts
pmat comply check --strict on alimentar changes

Phase 2: Pipeline Validation — 50M Model (Week 2) – COMPLETE

Write gradient-accumulation-kernel-v1.yaml contract (ALB-017)
Write configs/train/pretrain-50m.yaml (model arch + training + monitoring)
Train albor-50M on 4090 — 500 rows, 31 steps, 110.7s, loss 10.3→4.42
Validate apr monitor — ALB-025 FIXED (presentar widget migration complete)
Validate Andon alerts during full training run
~~Fix ALB-009~~ FIXED
Verify FALSIFY-ALBOR-001 (loss decreases) — CORROBORATED
Verify FALSIFY-ALBOR-002 (gradient bounds) — per-step logging now available (~~ALB-035~~ FIXED)
pv audit — PASS: 7/7 contracts, 0 findings
Milestone: Training loop converges ✓, contracts pass ✓

Phase 3: Base Model — 350M Pre-Training (Week 2-4) – IN PROGRESS

Write configs/train/pretrain-350m.yaml — pre-tokenized ByteLevel BPE v2, 22K×2048 tokens
Train albor-base-350m on 4090 — STARTED (2760 batches, ~20h est.)
Build evaluation infrastructure — eval-code.py, eval-perplexity.py, 35 benchmark problems
~~Fix ALB-038~~ FIXED — RMSNorm + attention backward ops, all 20 params receive gradients
~~Fix ALB-041~~ FIXED — D2D buffer size mismatch in backward_attention (entrenar@a48e3d2)
~~Fix ALB-043~~ FIXED — backward_ffn buffer overflow + SwiGLU gradients (entrenar@f7805f1)
~~Fix ALB-044~~ FIXED — activation gradient clipping at GPU-CPU boundary + CPU optimizer hyperparams (entrenar@86eec38)
~~Fix ALB-059~~ FIXED — GEMM backward constructor args n/k swapped, buffer overflow into optimizer states + zero-init optimizer m/v (entrenar@846ae0c)
Write training-memory-kernel-v1.yaml contract (ALB-039) — VRAM budget estimation
Write training-gpu-kernel-v1.yaml contract (ALB-040) — GPU-resident training invariants
Implement CudaTransformerTrainer (ALB-040) — 3 PCIe transfers/step vs ~16K
Dogfood CUDA training — 50M test: 3 steps, loss 10.4→11.7, GPU forward+backward working
~~ALB-037~~ FIXED — realizar loads trained SafeTensors checkpoint, generates tokens (e2e verified)
350M CUDA test training — 50 steps, loss 10.39→5.92 (best 5.53), checkpoint valid
realizar inference verified — 218 tensors loaded, generates from trained weights
Checkpoint validation: PASS (weights trained, not initialization)
Perplexity eval: 31,926 (finite, consistent with 50-step model — random baseline ~32,768)
~~Fix ALB-060~~ CONFIG FIXED — epochs=1 only ran 43/5000 steps. C-TRAINCFG-001 contract written. Config fixed (v1: epochs=117, v2: epochs=1 with 68K seqs)
Expand training data: Tier 1 10x + 8 Tier 2 repos → v2 dataset (67,977 seqs, 139M tokens)
~~Fix ALB-071~~ FIXED — embed gradient clipping decoupled from weight grad_clip (entrenar@d07d67d)
~~Fix ALB-072~~ FIXED — fp16 loss scaling (65536x) removed from fused CE kernel; all backward uses f32, no underflow risk (entrenar@44d3e74)
Full 350M v2 training — reached step 1183/5000, loss 10.40→6.85, val_ppl=1008. Crashed: ALB-073 (PTX selp) + ALB-074 (buffer overflow from stale binary). Step 1000 checkpoint saved (1520 MB).
~~Fix ALB-073~~ FIXED — fused_cross_entropy selp arg order, same class as ALB-069 (trueno@10bec89)
~~Fix ALB-074~~ FIXED — stale binary missed eval truncation fix. Rebuilt with entrenar@5c4c2d8.
Monitor training via apr monitor (ALB-025 FIXED)
Data scaling: Download codeparrot-clean (2M files, ~4.4B tokens) → pretokenize at 1024 → ~5.2M sequences
Full 350M v3 training — PENDING: 250K steps on ~1B tokens from codeparrot-clean. Config: pretrain-350m-v3.yaml. ETA ~10 days.
Validate loss curve, perplexity convergence
HumanEval pass@1 evaluation (target >8%)
Verify FALSIFY-ALBOR-003 (checkpoint determinism)
pmat tdg check-regression on all touched components
Milestone: HumanEval pass@1 > 8%, Perplexity < 30, TDG grade A maintained

Phase 4: Teacher Setup & Logit Pre-Computation (Week 3-5)

Fix ALB-010: Add Qwen3-Coder-Next support to realizar (stretch — 3-4 week blocker)
Download Qwen2.5-Coder-3B interim teacher (5.75 GiB, Apache 2.0) — unblocks distillation without ALB-010
Validate 3B teacher: apr distill --stage precompute works, RosettaStone handles sharded SafeTensors
Create distillation config: configs/train/distill-qwen3b.yaml (T=4.0, α=0.5, LoRA r=16)
Validate teacher inference on intel (CPU, fp16, 300GB RAM) — for 80B stretch goal
Write knowledge-distillation-kernel-v1.yaml contract (ALB-013) — DOGFOODING
pv kani on KD loss contract (KL non-negativity, temperature scaling)
~~Fix ALB-011~~ FIXED — apr distill --config --stage precompute|train works
Pre-compute 3B teacher logits on v2 dataset (background, 4-8h CPU)
Verify FALSIFY-ALBOR-006 (teacher logit integrity)
Store as sharded Parquet via alimentar
pmat comply check --strict on realizar changes
Milestone: Teacher logits verified, KD contract at Level 4

Phase 5: Knowledge Distillation (Week 5-6)

Implement apr distill apply with KD loss
Distill albor-base-350m → albor-distill-350m
Verify FALSIFY-ALBOR-004 (KL non-negativity in production)
Verify FALSIFY-ALBOR-005 (distillation improves benchmarks)
Benchmark: measure improvement over base
pv probar --binding on KD contract with actual training data
Milestone: >5% avg benchmark improvement, KD contract fully wired

Phase 6: Post-Training Optimization (Week 6-8)

Write model-merging-kernel-v1.yaml contract (ALB-015) — DOGFOODING
Write pruning-kernel-v1.yaml contract (ALB-016) — DOGFOODING
Fine-tune with LoRA: apr finetune → albor-instruct
Merge variants: apr merge --method slerp → albor-merged
Verify FALSIFY-ALBOR-007 (SLERP interpolation bound)
Prune: apr prune --method wanda → albor-pruned
Verify FALSIFY-ALBOR-008 (sparsity guarantee)
Quantize: apr quantize --method q4_k → albor-q4
Verify FALSIFY-ALBOR-009 (quantization fidelity)
Benchmark every variant
pv coverage contracts/ --binding — final contract coverage report
Milestone: Full ladder complete, all post-training contracts pass

Phase 7: Quality Assurance & Falsification Sweep (Week 8)

batuta falsify . --min-grade toyota-standard --verbose — full 108-item assessment
pmat rust-project-score --full on all touched components
pmat tdg check-regression --baseline — no quality regressions
pv graph contracts/ --format mermaid — publish verification DAG
pv status contracts/ — all contracts at Level 3+, critical at Level 4
cargo mutants --no-times on all new code — mutation score ≥ 85%
cargo llvm-cov — coverage ≥ 95% on all new code
Address any falsification failures or contract violations
Milestone: Toyota Standard grade, all quality gates green

Phase 8: Evaluation, Leaderboard Submission & Publication (Week 8-9)

Final eval on all benchmark tasks (all 6 model variants)
Run bigcode-evaluation-harness with leaderboard-standard params on best model
Submit PR to Big Code Models Leaderboard (community_results/ folder)
Export all models: SafeTensors + GGUF
apr publish to HuggingFace Hub as paiml/albor-*
Write model card with full reproducibility details + leaderboard results
Publish training logs, loss curves, eval trajectories
Publish verification report (contract status, falsification results)
batuta falsify . --format markdown --output docs/falsification-report.md
Milestone: Models on HuggingFace, leaderboard submission live, quality evidence published

Phase 9: Distributed Training — Stretch (Week 9+)

entrenar native DDP infrastructure (TCP wire protocol v2, GradientServer, WorkerClient, PerBlockGradientAccumulator, RingAllReduce) — entrenar#133
Wire DDP train_batch() into DistributedCudaTrainer — COMPLETE (train_loop_cuda_distributed, allreduce_impl, spawn_coordinator_thread)
Multi-process launcher — COMPLETE (rank 0 auto-spawns GradientServer, all ranks connect as WorkerClient via --distributed CLI flags)
wgpu backward pass in trueno (ALB-005) — for cross-vendor GPU support
Full distributed training: 4090 + W5700X x2
Milestone: Multi-GPU training demonstrated

16. Reproducibility Protocol

Every artifact in the albor pipeline is reproducible from source. This chapter documents the exact commands, seeds, and checksums needed to reproduce the full training pipeline from raw code corpora to trained model.

16.1 Artifact Tracking

Artifact	How Recorded
Random seed	42 (global), per-component seeds derived
Data versions	HuggingFace dataset commit SHAs + local repo git SHAs
Data provenance	`docs/PROVENANCE.md` (source path, git SHA, file count, token count per source)
Data checksums	SHA-256 of every Parquet shard (recorded in PROVENANCE.md)
Tokenizer v1	`models/albor-tokenizer/` (vocab.json + merges.txt + tokenizer.json)
Tokenizer v2	`models/albor-tokenizer-v2/tokenizer.json` (ByteLevel BPE)
Training config	YAML checked into git (`configs/train/*.yaml`)
Checkpoint hashes	SHA-256 of model.safetensors
Software versions	`apr --version`, `alimentar --version`, `pv --version`
Hardware	nvidia-smi + free -h captured in training logs
Training logs	`checkpoints/*/training.log` + `final_model.json`
Eval results	`configs/eval/*.jsonl` (benchmarks) + eval scripts

16.2 Full Reproduction Commands

Step 1: Corpus Preparation

v1 pipeline (Tier 1 only, 17K rows):

# Import Tier 1 ground truth corpora
alimentar import local /path/to/depyler -o data/raw/depyler.parquet
alimentar import local /path/to/hf-ground-truth-corpus -o data/raw/hf.parquet
alimentar import local /path/to/jax-ground-truth-corpus -o data/raw/jax.parquet
alimentar import local /path/to/vllm-ground-truth-corpus -o data/raw/vllm.parquet

# Mix training split (weighted sampling)
alimentar mix \
    data/raw/depyler.parquet:0.4 \
    data/raw/hf.parquet:0.3 \
    data/raw/jax.parquet:0.15 \
    data/raw/vllm.parquet:0.15 \
    -o data/tokenized/train/mixed.parquet \
    --seed 42

v2 pipeline (Tier 1 10x + 8 Tier 2 repos, 45K rows → 68K sequences):

# Convert Tier 2 source repos to Parquet (alimentar can't read source dirs)
for repo in pytorch hf-repos mlflow vllm-full tgi algo-corpus cuda-python llms-with-hf; do
    python3 scripts/source-to-parquet.py ~/src/$repo $repo data/parquet/tier2/$repo.parquet
done

# Mix Tier 1 (10x upsampled) + Tier 2 (1x)
alimentar mix \
    data/parquet/depyler/shard_0000.parquet:10.0 \
    data/parquet/hf-ground-truth/shard_0000.parquet:10.0 \
    data/parquet/jax/shard_0000.parquet:10.0 \
    data/parquet/vllm/shard_0000.parquet:10.0 \
    data/parquet/tier2/pytorch.parquet:1.0 \
    data/parquet/tier2/hf-repos.parquet:1.0 \
    data/parquet/tier2/mlflow.parquet:1.0 \
    data/parquet/tier2/vllm-full.parquet:1.0 \
    data/parquet/tier2/tgi.parquet:1.0 \
    data/parquet/tier2/algo-corpus.parquet:1.0 \
    data/parquet/tier2/cuda-python.parquet:1.0 \
    data/parquet/tier2/llms-with-hf.parquet:1.0 \
    -o data/staging/mixed-expanded.parquet --seed 42

# Apply FIM (50% PSM)
alimentar fim data/staging/mixed-expanded.parquet \
    -o data/staging/mixed-expanded-fim.parquet --rate 0.5 --format psm --seed 42

Step 2: Tokenizer Training

# v1 tokenizer (whitespace-split BPE — has ALB-036 limitation)
apr tokenize apply \
    --data data/staging/corpus-raw.txt \
    --vocab-size 32768 \
    --algorithm bpe \
    -o models/albor-tokenizer/ \
    --max-lines 100000

# v2 tokenizer (ByteLevel BPE — preserves whitespace)
python scripts/train-tokenizer-v2.py \
    --corpus data/staging/corpus-raw.txt \
    --vocab-size 32768 \
    --output models/albor-tokenizer-v2/

Step 3: Pre-Tokenization

# Pre-tokenize full training data (v2 tokenizer, 2048-token chunks)
python scripts/pretokenize.py \
    --input data/tokenized/train/mixed.parquet \
    --tokenizer models/albor-tokenizer-v2/tokenizer.json \
    --seq-len 2048 \
    --output data/pretokenized-2048/train/train.parquet

# Pre-tokenize validation data
python scripts/pretokenize.py \
    --input data/tokenized/val/val.parquet \
    --tokenizer models/albor-tokenizer-v2/tokenizer.json \
    --seq-len 2048 \
    --output data/pretokenized-2048/val/val.parquet

Step 4: Model Training

# 50M pipeline validation (< 2 minutes)
make train-50m
# Equivalent to:
# apr train apply --task pretrain --config configs/train/pretrain-50m.yaml

# 350M base model, v2 data (~20 hours on RTX 4090)
apr train apply --task pretrain --config configs/train/pretrain-350m-v2.yaml
# v2 config: epochs=38, warmup=500, 67977 seqs, 5000 max_steps
# C-TRAINCFG-001 verified: steps_per_epoch=132, 38×132=5016 >= 5000

# Legacy v1 (22K seqs, fixed epochs=117 post ALB-060)
# apr train apply --task pretrain --config configs/train/pretrain-350m.yaml

Step 5: Checkpoint Conversion (for evaluation)

# Convert entrenar 1D-flat SafeTensors to realizar 2D format
python scripts/convert-checkpoint.py checkpoints/albor-base-350m/ \
    --config configs/train/pretrain-350m.yaml

Step 6: Evaluation

# Validate all benchmarks (no model needed)
make eval-validate

# Perplexity evaluation (needs trained model)
make eval-perplexity-350m

# Monitor active training
make training-status

16.3 Key SHA-256 Checksums

See docs/PROVENANCE.md for complete checksums. Key artifacts:

Artifact	SHA-256 (first 8 hex)
Training data (mixed.parquet)	`bdfe8742`
Val data (val.parquet)	`6be03768`
v1 tokenizer (vocab.json)	`aca6fa72`
v2 tokenizer (tokenizer.json)	`d999cc9e`
Pre-tokenized train (2048)	`4f54e422`
Pre-tokenized val (2048)	`c9c1d093`

16.4 Verification

# Verify data checksums
sha256sum data/tokenized/train/mixed.parquet
sha256sum data/pretokenized-2048/train/train.parquet
sha256sum models/albor-tokenizer-v2/tokenizer.json

# Verify training config reproducibility
apr train plan --task pretrain --config configs/train/pretrain-350m.yaml

# Verify contract integrity
pv validate contracts/*.yaml
pv coverage contracts
pv audit contracts/*.yaml

17. Success Criteria

Minimum Viable (Phase 3 complete)

350M base model trained on 4090 to convergence (target: ~10B tokens; current: 139M v2 dataset)
FIM (fill-in-the-middle) training implemented and validated (~~ALB-018~~ FIXED — alimentar fim verified)
HumanEval pass@1 > 8% (baseline Python capability, beat random)
HumanEval-FIM working (model can infill Python code)
Entire pipeline uses only sovereign stack components
All training artifacts reproducible from spec
All existing kernel contracts pass pv audit (Level 2+)
pmat comply check passes on all modified components

Current blockers for Phase 3 completion:

~~ALB-038 (Critical): entrenar saves initialization weights, not trained weights~~ FIXED (entrenar@91ba9da, @1ede409)
~~ALB-035: No per-step loss logging during training~~ FIXED (entrenar@5d41a96)
~~ALB-041: D2D buffer mismatch in backward_attention~~ FIXED (entrenar@a48e3d2)
~~ALB-037: realizar ignores loaded weights~~ FIXED (e2e verified: realizar run loads 350M trained checkpoint, generates tokens from 218 tensors)
~~ALB-043 (Critical): backward_ffn buffer overflow + missing SwiGLU gradients~~ FIXED (entrenar@f7805f1)
~~ALB-044 (Critical): activation gradient clipping + CPU optimizer hyperparams~~ FIXED (entrenar@86eec38)
~~ALB-059 (Critical): GEMM backward constructor n/k swapped — buffer overflow into optimizer states~~ FIXED (entrenar@846ae0c)
~~ALB-040: GPU-resident pretraining~~ VERIFIED — 350M CUDA test: 50 steps, loss 10.39→5.92, checkpoint valid, realizar inference works
ALB-042: CUDA runtime errors produce silent loss=0.0 — OPEN (workaround: CUDA_VISIBLE_DEVICES="")
~~ALB-069 (Critical): PTX selp_f32 argument order in fused cross-entropy~~ FIXED (trueno@10bec89)
~~ALB-060 (Critical)~~: Training ran only 43/5000 steps (epochs=1). CONFIG FIXED: C-TRAINCFG-001 contract + v2 config. V2 training (ALB-063) restarted after ALB-069 fix — PID 106929, loss=10.39 at step 1.

350M CUDA test results (50 steps, post ALB-059 fix):

Loss: 10.39 → 5.92 (best: 5.53) — clear convergence with correct GEMM backward
Training time: ~400s (~8s/step) with PTX; ~26s (~0.5s/step) with cuBLAS (ALB-075/077)
Checkpoint: 1.59 GB SafeTensors, 218 tensors, config.json saved
Checkpoint validation: PASS (weights trained, layers distinct)
realizar inference: loads model, generates tokens (gibberish at 50 steps — expected)
Perplexity: 31,926 (finite; random baseline ~32,768 for vocab 32K)

350M v3 training (250K steps, codeparrot-clean, ALB-077 fix) — STOPPED:

Final: step 28K, loss=6.43, val_ppl=1018, 6.7K tok/s, 19.3% MFU
Plateau since step 12K — val_ppl stalled at ~1000, gnorm collapsed 3.0→0.13
Root cause: ALB-079 (constant lr after warmup, no cosine decay) + ALB-080 (4K tokens/step, 48-128x too small)
Checkpoints: step 1K-28K (1520 MB each, all verified OK)
No NaN in 28K steps (ALB-077: tensor cores disabled, CUBLAS_DEFAULT_MATH)

350M v4 training (ALB-079 + ALB-080 fixes) — RESUMED from step 500:

Fixes: cosine LR decay (entrenar PR #241) + gradient_accumulation=32 (131K tokens/step)
Original run: 500 steps, val_ppl=1032.7 (matched v3 at 57% token budget)
System reboot at step 553; resumed from step-500 checkpoint
Extended resume: step 350 (cum. step 850), best loss=5.69 at step 262
111M tokens processed (2.1% of 5.3B available); loss plateau at mean ~6.65
Cosine decay just engaging (lr 3.00e-4→2.98e-4); expect plateau break at step 1000+
ZClip catching gradient spikes (z=2.0–4.0), gnorm healthy 0.05–0.32
Throughput: 3,564–3,569 tok/s steady, 10.3% MFU, 14-16 GB / 24 GB VRAM
Target: val_ppl < 100 by 1B tokens (~60 hours remaining)
Same hardware (RTX 4090), same data (codeparrot-clean, 5.3B tokens available)

Good (Phase 5 complete)

Distillation from Qwen3.5-35B-A3B demonstrated (ALB-010); fallback: Qwen2.5-Coder-3B (dense)
albor-distill-350m outperforms albor-base-350m on all code benchmarks
HumanEval pass@1 > 15% (beat CodeGen-350M-mono’s 12.8% via distillation from 35B MoE teacher)
MBPP pass@1 > 12%
FIM infill working (qualitatively: model can complete Python between prefix and suffix)
KD contract at Level 4 (Kani-proved KL non-negativity)
All FALSIFY-ALBOR tests pass (001-006)

Full Success (Phase 8 complete)

All 6 model variants benchmarked (base → distill → instruct → merged → pruned → q4)
Benchmark trajectory published showing improvement at each stage
Submitted to Big Code Models Leaderboard — first sub-1B model on the board
Q4 model: <50ms/token on CPU, <10ms/token on GPU (code completion latency)
Critical path gaps (ALB-001, 006, 009, 011, 018) closed with upstream fixes; ALB-010 (Qwen3.5-35B-A3B MoE inference) PR #133 MERGED, weight loading remaining
Models published on HuggingFace as paiml/albor-python-*
Q4 quantized model < 100MB, runs on consumer hardware
All 8 kernel contracts written and verified (ALB-013–017, ALB-039–040, ALB-060)
batuta falsify: Toyota Standard grade (≥90/108) — ACHIEVED: 100% (108/108 PASS)
pmat TDG: Grade A on all touched components
Test coverage ≥ 95%, mutation score ≥ 85% on all new code
All 9 FALSIFY-ALBOR tests pass
Verification DAG published via pv graph

Stretch Goals

HumanEval pass@1 > 20% (strong distillation result at 350M)
DS-1000 pass@1 > 10% (data science code generation)
Editor integration: VS Code / Neovim / Helix extension using realizar as backend
Distributed gradient-parallel training across 4090 + W5700X demonstrated (entrenar DDP #133 infra in place)
apr pipeline apply reproduces entire ladder from bare metal to published model
BabyLM 2026 submission using constrained data variant
All critical kernels at Level 4 (Kani formal proofs)
Lean 4 theorem stubs generated for core training loop invariants

18. Reference Commands

# ═══════════════════════════════════════════════════════════
# THE PIPELINE (two orchestrators working together)
# ═══════════════════════════════════════════════════════════

# Infrastructure provisioning (forjar — bare metal to ready state)
forjar validate -f configs/pipeline/infra-only.yaml   # Validate
forjar apply -f configs/pipeline/infra-only.yaml       # Provision

# ML pipeline orchestration (batuta playbook — data to published model)
batuta playbook validate configs/pipeline/albor-playbook.yaml  # Validate DAG
batuta playbook run configs/pipeline/albor-playbook.yaml       # Execute (resumable)
batuta playbook status configs/pipeline/albor-playbook.yaml    # Check progress

# Unified pipeline (apr pipeline wraps forjar + batuta)
apr pipeline plan configs/pipeline/albor.yaml
apr pipeline apply configs/pipeline/albor.yaml
apr pipeline status

# ═══════════════════════════════════════════════════════════
# DATA PIPELINE
# ═══════════════════════════════════════════════════════════

# Import local codebases
alimentar import local /path/to/codebase -o data/raw/corpus.parquet

# Weighted mix with upsampling
alimentar mix a.parquet:0.4 b.parquet:0.3 c.parquet:0.15 d.parquet:0.15 \
    -o data/tokenized/train/mixed.parquet --seed 42

# FIM transform
alimentar fim data.parquet -o data-fim.parquet --rate 0.5 --format psm

# Quality profiles
alimentar quality profiles

# ═══════════════════════════════════════════════════════════
# TOKENIZER
# ═══════════════════════════════════════════════════════════

# v1: BPE with apr (whitespace-split — ALB-036 limitation)
apr tokenize plan --data corpus.txt --vocab-size 32768
apr tokenize apply --data corpus.txt --vocab-size 32768 --algorithm bpe -o tokenizer/

# v2: ByteLevel BPE with Python (recommended — preserves whitespace)
python scripts/train-tokenizer-v2.py --corpus corpus.txt --vocab-size 32768 \
    --output models/albor-tokenizer-v2/

# Pre-tokenize for training (bypasses tokenizer format gap ALB-033)
python scripts/pretokenize.py --input data.parquet \
    --tokenizer models/albor-tokenizer-v2/tokenizer.json \
    --seq-len 2048 --output data/pretokenized-2048/train/train.parquet

# ═══════════════════════════════════════════════════════════
# TRAINING
# ═══════════════════════════════════════════════════════════

# Plan (dry-run, validate config)
apr train plan --task pretrain --config configs/train/pretrain-350m.yaml

# Train (execute)
apr train apply --task pretrain --config configs/train/pretrain-350m.yaml

# Makefile shortcuts
make train-50m        # ~2 min on RTX 4090
make train-350m       # ~20 hours on RTX 4090
make training-status  # Check running training

# ═══════════════════════════════════════════════════════════
# EVALUATION
# ═══════════════════════════════════════════════════════════

# apr eval (perplexity — ALB-037 FIXED, realizar loads checkpoints)
apr eval checkpoints/albor-base-350m/model.safetensors \
    --dataset custom --text "def foo():" --threshold 30

# Python eval scripts (supplement)
python scripts/eval-code.py configs/eval/humaneval-subset.jsonl --validate-only
python scripts/eval-code.py configs/eval/humaneval-subset.jsonl --api http://localhost:8080
python scripts/eval-perplexity.py checkpoints/albor-base-350m/ \
    --data data/pretokenized-2048/val/val.parquet --seq-len 2048 --threshold 30

# Convert entrenar checkpoint for realizar
python scripts/convert-checkpoint.py checkpoints/albor-base-350m/ \
    --config configs/train/pretrain-350m.yaml

# Makefile shortcuts
make eval-validate           # Validate all benchmark canonical solutions
make eval-perplexity-350m    # Run perplexity eval

# ═══════════════════════════════════════════════════════════
# MONITORING (run in a separate terminal during training)
# ═══════════════════════════════════════════════════════════

bash scripts/monitor-training.sh                     # Training process + GPU + log
apr monitor ./checkpoints/albor-base-350m/           # Live training TUI (ALB-025 FIXED)
apr experiment view --db .entrenar/experiments.db     # Browse past experiments

# ═══════════════════════════════════════════════════════════
# POST-TRAINING (Phases 4-6)
# ═══════════════════════════════════════════════════════════

# Distillation
apr distill --config configs/train/distill.yaml --plan
apr distill --config configs/train/distill.yaml --stage precompute
apr distill --config configs/train/distill.yaml --stage train

# Fine-tuning
apr finetune --plan --model-size 350M --vram 24 --method lora --rank 16

# Model operations
apr merge a.safetensors b.safetensors --strategy slerp -o merged.safetensors
apr prune model.safetensors --method wanda --sparsity 0.5 -o pruned.safetensors
apr quantize model.safetensors --method q4_k -o model.gguf
apr export model.safetensors --format gguf -o model.gguf
apr publish checkpoints/albor-350m/ paiml/albor-base-350m

# ═══════════════════════════════════════════════════════════
# QUALITY (bashrs is KING of linting)
# ═══════════════════════════════════════════════════════════

# bashrs — sovereign linter for all shell artifacts
bashrs make lint Makefile                          # Makefile quality
bashrs classify Makefile                           # Safety classification
bashrs make purify Makefile                        # Deterministic output

# provable-contracts — kernel correctness
pv validate contracts/*.yaml                       # Contract schemas
pv coverage contracts                              # Obligation coverage
pv generate contracts/*.yaml                       # Scaffold + tests + harnesses
pv book contracts/                                 # mdBook pages
pv audit contracts/*.yaml                          # Audit for issues
pv graph contracts/ --format mermaid               # Verification DAG
pv lean contracts/*.yaml                           # Lean 4 theorem stubs

# batuta — falsification
batuta falsify . --format markdown                 # 108-item checklist
batuta oracle --list                               # Stack components
batuta oracle --local                              # Local workspace status

# pmat — code quality (upstream repos)
pmat tdg baseline create                           # TDG baseline
pmat comply check --strict ../aprender

# ═══════════════════════════════════════════════════════════
# VALIDATION (Makefile)
# ═══════════════════════════════════════════════════════════

make validate          # All validation (YAML + contracts + forjar + Makefile)
make lint              # Lint with bashrs
make eval-validate     # Validate benchmark canonical solutions
make dogfood           # Full 12-section dogfooding suite
make book              # Build mdBook
make help              # Show all targets

knowledge-distillation-kernel-v1

Version: 1.0.0

Knowledge distillation kernel — temperature-scaled KL divergence + cross-entropy

References

Hinton et al. (2015) Distilling the Knowledge in a Neural Network
Ba & Caruana (2014) Do Deep Nets Really Need to be Deep?

Dependencies

Dependency Graph

graph LR
    knowledge_distillation_kernel_v1["knowledge-distillation-kernel-v1"] --> softmax_kernel_v1["softmax-kernel-v1"]
    knowledge_distillation_kernel_v1["knowledge-distillation-kernel-v1"] --> cross_entropy_kernel_v1["cross-entropy-kernel-v1"]

Equations

kd_loss

$$ L_KD = alpha * KL(softmax(z_t/T) || softmax(z_s/T)) * T^2 + (1-alpha) * CE(y, z_s) $$

Domain: $z_t, z_s in R^V, T > 0, alpha in [0,1]$

Codomain: $L_KD in [0, +inf)$

Invariants:

$L_KD >= 0 (non-negativity from KL and CE non-negativity)$
$alpha=0 => L_KD = CE(y, z_s) (pure hard label)$
$alpha=1 => L_KD = T^2 * KL(teacher || student) (pure soft label)$

kl_divergence

$$ KL(P || Q) = sum_i P(i) * \log(P(i) / Q(i)) $$

Domain: $P, Q valid probability distributions over V classes$

Codomain: $KL in [0, +inf)$

Invariants:

$KL(P || Q) >= 0 (Gibbs inequality)$
$KL(P || P) = 0 (identity)$

temperature_softmax

$$ softmax(z/T)_i = \exp(z_i/T) / sum_j \exp(z_j/T) $$

Domain: $z in R^V, T > 0$

Codomain: $softmax in (0, 1)^V, sum = 1$

Invariants:

$All outputs strictly positive$
$Outputs sum to 1$
$T -> inf => uniform distribution$
$T -> 0 => one-hot on argmax$

Proof Obligations

#	Type	Property	Formal
1	invariant	KL non-negativity	$KL(P \|\| Q) >= 0 for all valid P, Q$
2	bound	Temperature scaling produces valid distribution	$softmax(z/T)_i > 0 and sum_i softmax(z/T)_i = 1 for T > 0$
3	invariant	Alpha interpolation bound	$alpha=0 => L_KD = CE; alpha=1 => L_KD = T^2 * KL$
4	equivalence	Gradient correctness	$analytical gradient matches numerical gradient within 1e-4$
5	invariant	T^2 gradient compensation	$gradient magnitude approximately constant across T in [1, 10]$
6	equivalence	SIMD matches scalar within ULP

Kernel Phases

teacher_softmax: Compute softmax(z_t / T) — teacher soft targets — output is valid probability distribution
student_softmax: Compute softmax(z_s / T) — student soft predictions — output is valid probability distribution
kl_divergence: Compute KL(teacher || student) — result >= 0
cross_entropy: Compute CE(y, z_s) — hard label loss — result >= 0
combine: Combine: alpha * T^2 * KL + (1-alpha) * CE — result >= 0

Falsification Tests

ID	Rule	Prediction	If Fails
FALSIFY-KD-001	KL non-negativity	KL(teacher \|\| student) >= 0 for all batches	Log-domain computation error or softmax numerical instability
FALSIFY-KD-002	Temperature boundary	softmax(z/T) approaches uniform as T -> inf	Overflow in exp(z/T) for small T or large z
FALSIFY-KD-003	Alpha boundary conditions	alpha=0 => KD loss equals CE loss exactly	Alpha interpolation not applied correctly
FALSIFY-KD-004	Gradient correctness	Analytical gradient matches finite-difference within 1e-4	Derivative of KL or CE computed incorrectly
FALSIFY-KD-005	Distillation value	albor-distill avg benchmark > albor-base avg benchmark	Teacher logits corrupted, T too high/low, or alpha miscalibrated

Kani Harnesses

ID	Obligation	Bound	Strategy
KANI-KD-001	KD-INV-001	8	stub_float
KANI-KD-002	KD-INV-002	8	stub_float

QA Gate

Knowledge Distillation Contract (F-KD-001)

KD loss correctness for Albor distillation pipeline

Checks: kl_non_negativity, temperature_validity, alpha_interpolation, gradient_correctness

Pass criteria: All 5 falsification tests pass + 2 Kani harnesses verify

bpe-tokenizer-kernel-v1

Version: 1.0.0

BPE tokenizer kernel — byte-pair encoding with lossless roundtrip

References

Sennrich et al. (2016) Neural Machine Translation of Rare Words with Subword Units
Gage (1994) A New Algorithm for Data Compression

Equations

bpe_merge

$$ merge(a, b) = ab where (a,b) = argmin_{(p,q) in pairs} rank(p,q) $$

Domain: $token sequence with adjacent pairs$

Codomain: $shorter token sequence$

Invariants:

$Each merge reduces sequence length by at least 1$
$Merge ordering is deterministic$
$Final sequence uses only tokens in vocabulary$

roundtrip

$$ decode(encode(x)) = x for all x in UTF-8 $$

Domain: $x: valid UTF-8 string$

Codomain: $encode(x): Vec where each id in [0, V)$

Invariants:

$Lossless roundtrip for all valid UTF-8$
$Empty input maps to empty output$
$Byte-level fallback ensures all byte values representable$

Proof Obligations

#	Type	Property	Formal
1	invariant	Roundtrip lossless	$decode(encode(x)) = x for all valid UTF-8 x$
2	invariant	Byte-level completeness	$Every byte value 0x00-0xFF is representable (no UNK)$
3	idempotency	Deterministic encoding	$encode(x) = encode(x) for repeated calls on same input$
4	invariant	Vocab size correctness	$len(tokenizer.vocab) = V (configured vocab size)$
5	invariant	FIM sentinel tokens are atomic	$encode(<fim_prefix>) returns exactly one token ID$
6	invariant	Empty input handling	$encode(‘’) = [] and decode([]) = ‘’$

Kernel Phases

byte_encode: Convert UTF-8 string to byte sequence — bytes are valid UTF-8 representation
initial_tokenize: Map bytes to initial token IDs (byte-level) — all bytes have a token mapping
bpe_merge: Iteratively apply BPE merge rules in priority order — sequence length decreases monotonically
output: Return final token ID sequence — all IDs in [0, vocab_size)

Falsification Tests

ID	Rule	Prediction	If Fails
FALSIFY-TOK-001	Roundtrip invariant	decode(encode(x)) = x for random UTF-8 strings	Merge rule corrupts byte boundaries or special chars
FALSIFY-TOK-002	Byte completeness	Every single-byte string encodes without UNK	Byte-level fallback tokens missing from vocabulary
FALSIFY-TOK-003	Determinism	Same input always produces same tokens	Non-deterministic merge ordering (HashMap or thread race)
FALSIFY-TOK-004	FIM sentinels	Each FIM sentinel token encodes to exactly one token	Sentinel tokens not added to vocabulary as special tokens

Kani Harnesses

ID	Obligation	Bound	Strategy
KANI-TOK-001	TOK-INV-001	16	exhaustive

QA Gate

BPE Tokenizer Contract (F-TOK-001)

Tokenizer correctness for Albor vocabulary

Checks: roundtrip_lossless, byte_completeness, deterministic_encoding, fim_sentinel_atomic

Pass criteria: All 4 falsification tests pass + Kani roundtrip harness verifies

gradient-accumulation-kernel-v1

Version: 1.0.0

Gradient accumulation kernel — numerical equivalence of micro-batch accumulation

References

Goyal et al. (2017) Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Dependencies

adamw-kernel-v1

Dependency Graph

graph LR
    gradient_accumulation_kernel_v1["gradient-accumulation-kernel-v1"] --> adamw_kernel_v1["adamw-kernel-v1"]

Equations

accumulation

$$ G_accum = (1/N) * sum_{i=1}^{N} g_i $$

Domain: $g_i: gradient from micro-batch i, N: accumulation steps$

Codomain: $G_accum: accumulated gradient tensor$

Invariants:

$G_accum approximates G_full within fp tolerance$
$N=1 => G_accum = g_1 exactly$

loss_scaling

$$ L_scaled = (1/N) * L_micro $$

Domain: $L_micro: micro-batch loss, N: accumulation steps$

Codomain: $L_scaled: scaled loss for backward pass$

Invariants:

$Total loss = mean of micro-batch losses (not sum)$
$Gradients are correctly scaled by 1/N$

Proof Obligations

#	Type	Property	Formal
1	equivalence	Numerical equivalence	$\|\|G_accum - G_full\|\| < epsilon (1e-5 fp32, 1e-3 fp16)$
2	invariant	Loss scaling correctness	$Total loss = mean(micro_batch_losses)$
3	invariant	Gradient zeroing between cycles	$No stale gradients from previous accumulation cycle$
4	invariant	Optimizer step frequency	$optimizer.step() called once per N micro-batches$
5	invariant	Mixed precision accumulation in fp32	$Accumulation buffer dtype is fp32 even when forward uses fp16$
6	invariant	Gradient clipping after accumulation	$Clipping applied to accumulated gradient, not per micro-batch$

Kernel Phases

zero_gradients: Zero gradient buffers at start of accumulation cycle — all gradient values are 0.0
accumulate: Add scaled micro-batch gradients: G += (1/N) * g_i — accumulation buffer is fp32
clip: Apply gradient clipping to accumulated gradient — ||G_clipped|| <= max_norm
step: Optimizer updates parameters using accumulated gradient — called exactly once per N micro-batches

Falsification Tests

ID	Rule	Prediction	If Fails
FALSIFY-GA-001	Numerical equivalence	Accumulated gradient matches full-batch gradient within tolerance	Scaling factor (1/N) not applied, or accumulation buffer wrong dtype
FALSIFY-GA-002	Gradient zeroing	No gradient leakage between accumulation cycles	Gradient buffers not zeroed before new cycle
FALSIFY-GA-003	Step count	Exactly 3 optimizer steps for 3N micro-batches	Step called per micro-batch instead of per cycle
FALSIFY-GA-004	Clip after accumulate	One large micro-batch gradient triggers clipping once on total	Clipping applied per micro-batch instead of on accumulated total

Kani Harnesses

ID	Obligation	Bound	Strategy
KANI-GA-001	GA-EQ-001	4	stub_float
KANI-GA-002	GA-INV-001	8	exhaustive

QA Gate

Gradient Accumulation Contract (F-GA-001)

Gradient accumulation correctness for Albor training

Checks: numerical_equivalence, gradient_zeroing, step_count, clip_after_accumulate

Pass criteria: All 4 falsification tests pass + 2 Kani harnesses verify

model-merging-kernel-v1

Version: 1.0.0

Model merging kernel — SLERP, TIES, and DARE weight interpolation

References

Shoemake (1985) Animating Rotation with Quaternion Curves (SLERP)
Yadav et al. (2023) TIES-Merging: Resolving Interference When Merging Models
Yu et al. (2023) Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch (DARE)

Equations

dare

$$ tau_tilde_i = m_i * tau_i / (1-p) where m_i ~ Bernoulli(1-p) $$

Domain: $tau_i (task vector), p in [0, 1) (drop probability)$

Codomain: $tau_tilde_i: rescaled sparse task vector$

Invariants:

$E[tau_tilde] = tau (unbiased estimator)$
$Sparsity approximately p$

slerp

$$ SLERP(w1, w2, t) = sin((1-t)Omega)/sin(Omega) * w1 + sin(tOmega)/sin(Omega) * w2 $$

Domain: $w1, w2 in R^n (weight vectors), t in [0, 1], cos(Omega) = w1.w2 / (||w1|| * ||w2||)$

Codomain: $result in R^n with ||result|| approximately ||w1||$

Invariants:

$SLERP(w1, w2, 0) = w1 (left boundary)$
$SLERP(w1, w2, 1) = w2 (right boundary)$
$||SLERP(w1, w2, t)|| approximately ||w1|| for normalized inputs$

ties

$$ w_merged = w_base + lambda * elect(trim(tau_1, …, tau_n)) $$

Domain: $tau_i = w_i - w_base (task vectors), trim ratio k in [0,1]$

Codomain: $w_merged in R^n$

Invariants:

$After trim(k%), exactly k% of delta weights are zeroed per layer$
$Sign election resolves conflicts by majority vote$

Proof Obligations

#	Type	Property	Formal
1	bound	SLERP interpolation bound	$\|\|SLERP(w1, w2, t)\|\| within 1% of \|\|w1\|\| for normalized inputs$
2	invariant	SLERP boundary conditions	$SLERP(w1, w2, 0) = w1 and SLERP(w1, w2, 1) = w2$
3	invariant	TIES trim sparsity	$After trim(k%), exactly k% of deltas are zero$
4	invariant	DARE unbiased estimator	$E[tau_tilde] = tau over many samples$
5	invariant	Architecture compatibility check	$Merge rejects incompatible architectures with clear error$

Kernel Phases

validate_architectures: Verify all input models have same architecture — hidden_size, num_layers, vocab_size match
compute_task_vectors: Compute delta from base: tau_i = w_i - w_base — tau has same shape as w
merge_weights: Apply SLERP/TIES/DARE to combine weights — output weights are finite

Falsification Tests

ID	Rule	Prediction	If Fails
FALSIFY-MERGE-001	SLERP interpolation bound	\|\|SLERP(w1, w2, t)\|\| within 1% of \|\|w1\|\| for normalized inputs	SLERP uses LERP instead, or normalization missing
FALSIFY-MERGE-002	SLERP boundary	SLERP(w1, w2, 0) = w1 exactly (within fp tolerance)	Off-by-one in interpolation parameter
FALSIFY-MERGE-003	DARE unbiased	Average of 10000 DARE samples within 1e-2 of original	Rescaling factor (1-p) not applied correctly

Kani Harnesses

ID	Obligation	Bound	Strategy
KANI-MERGE-001	MERGE-BND-001	4	stub_float

QA Gate

Model Merging Contract (F-MERGE-001)

Weight merging correctness for Albor post-training

Checks: slerp_bound, slerp_boundary, dare_unbiased

Pass criteria: All 3 falsification tests pass + Kani SLERP harness verifies

pruning-kernel-v1

Version: 1.0.0

Pruning kernel — WANDA and magnitude-based weight pruning

References

Sun et al. (2023) A Simple and Effective Pruning Approach for Large Language Models (WANDA)
Han et al. (2015) Learning both Weights and Connections for Efficient Neural Networks

Equations

magnitude_score

$$ score(w_ij) = |w_ij| $$

Domain: $w_ij: weight value$

Codomain: $score in [0, +inf)$

Invariants:

$score >= 0$
$score = 0 iff w_ij = 0$

sparsity

$$ s = |{w : w = 0}| / |w| $$

Domain: $w: weight tensor$

Codomain: $s in [0, 1]$

Invariants:

$s = 0 means no pruning$
$s = 1 means all weights zeroed$
$After pruning with target s, achieved sparsity within 0.1% of s$

wanda_score

$$ score(w_ij) = |w_ij| * ||X_j||_2 $$

Domain: $w_ij: weight, X_j: activation column vector$

Codomain: $score in [0, +inf)$

Invariants:

$score >= 0 (product of norms)$
$score = 0 iff w_ij = 0 or X_j = 0$

Proof Obligations

#	Type	Property	Formal
1	invariant	Sparsity target met	$Achieved sparsity within +/-0.1% of target$
2	ordering	Score ordering preserved	$All pruned weights have score <= all surviving weights$
3	invariant	WANDA activation dependency	$Same weight magnitude + different activation norms => different WANDA scores$
4	invariant	Zero sparsity is identity	$prune(model, sparsity=0) returns original model unchanged$
5	invariant	Full sparsity zeroes all	$prune(model, sparsity=1.0) zeroes all prunable weights$
6	invariant	Embedding layer excluded	$Embedding and output projection weights untouched by pruning$

Kernel Phases

compute_scores: Compute importance score for each weight — scores are non-negative
determine_threshold: Find threshold score for target sparsity — threshold partitions weights into keep/prune sets
apply_mask: Zero out weights below threshold — sparsity matches target within tolerance

Falsification Tests

ID	Rule	Prediction	If Fails
FALSIFY-PRUNE-001	Sparsity guarantee	Exactly 50% of weights zero after prune –sparsity 0.5	Threshold computation error or layer exclusion bug
FALSIFY-PRUNE-002	Score ordering	All pruned weights have score <= all surviving weights	Sorting or partitioning algorithm bug
FALSIFY-PRUNE-003	Identity at zero sparsity	Pruning with sparsity=0 returns original weights	Off-by-one in threshold or mask computation

Kani Harnesses

ID	Obligation	Bound	Strategy
KANI-PRUNE-001	PRUNE-INV-001	16	stub_float

QA Gate

Pruning Contract (F-PRUNE-001)

Weight pruning correctness for Albor model compression

Checks: sparsity_guarantee, score_ordering, identity_at_zero

Pass criteria: All 3 falsification tests pass + Kani sparsity harness verifies

training-memory-kernel-v1

Version: 1.0.0

Training memory estimation kernel — closed-form VRAM projection from architecture

References

Korthikanti et al. (2022) Reducing Activation Recomputation in Large Transformer Models
Rajbhandari et al. (2020) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Dependency Graph

graph LR
    training_gpu_kernel_v1["training-gpu-kernel-v1"] --> training_memory_kernel_v1["training-memory-kernel-v1"]

Equations

activation_memory

$$ M_act = L × S × H × K × 4 K = 10 (Q, K, V, attn_scores, attn_out, gate, up, down, 2×residual)

Domain: $L: num_layers, S: seq_len, H: hidden_size, K: activation tensor count per layer (upper bound), 4: bytes per f32 element $

Codomain: $M_act: peak activation memory in bytes (upper bound)$

Invariants:

$entrenar processes batch items sequentially — activation memory is per single sequence$
$K=10 is conservative upper bound; actual depends on tensor lifetime overlap$
$Gradient checkpointing reduces M_act to O(\sqrt{L}) but is not default$

gradient_memory

$$ M_grad = P_total × 4 $$

Domain: $P_total: parameter count$

Codomain: $M_grad: gradient memory in bytes (exact)$

Invariants:

$Gradients always f32 regardless of mixed precision mode$
$One gradient tensor per parameter$

optimizer_memory

$$ M_opt = P_total × 8 $$

Domain: $P_total: parameter count$

Codomain: $M_opt: AdamW optimizer state memory in bytes (exact)$

Invariants:

$AdamW stores first moment (m) and second moment (v), both f32$
$M_opt = P × 4 (m) + P × 4 (v) = P × 8$

parameter_count

$$ P_embed = V × H P_layer = 2H + H² + H×D_kv + H×D_kv + H² + H×I + H×I + I×H = 2H + 2H² + 2H×D_kv + 3H×I P_norm = H P_total = P_embed + L × P_layer + P_norm

Domain: $V: vocab_size, H: hidden_size, L: num_hidden_layers, D_kv: num_kv_heads × head_dim, I: intermediate_size, head_dim: H / num_attention_heads $

Codomain: $P_total: total trainable parameter count (exact)$

Invariants:

$P_total is deterministic given architecture — no randomness$
$P_embed dominates for large vocab; P_layer dominates for deep models$

total_memory

$$ M_total = M_weights + M_grad + M_opt + M_act + M_cuda $$

Domain: $M_cuda \approx 512 MB (CUDA context, cuBLAS workspace, allocator overhead)$

Codomain: $M_total: total estimated memory in bytes$

Invariants:

$M_total is an upper bound — actual usage may be lower due to tensor reuse$
$Does not include KV cache (inference only, not training)$
$entrenar hybrid mode: weights/grads/optimizer live in CPU RAM; only matmul operands transfer to GPU$
$In hybrid mode, VRAM \approx M_cuda + max(matmul_operand_pair); CPU RAM \approx M_weights + M_grad + M_opt + M_act$
$M_total represents peak system memory (CPU+GPU) needed, not VRAM alone$

weight_memory

$$ M_weights = P_total × B_w $$

Domain: $P_total: parameter count, B_w: bytes per weight (4 for f32, 2 for fp16/bf16)$

Codomain: $M_weights: weight memory in bytes (exact)$

Invariants:

$Mixed precision stores master weights in f32 + fp16 copy: M_weights = P × (4 + 2)$
$entrenar current impl: always f32 storage, fp16 cast at matmul site$

Proof Obligations

#	Type	Property	Formal
1	equivalence	Parameter count is exact	$P_total = P_embed + L × P_layer + P_norm for LLaMA architecture$
2	equivalence	Weight memory is exact	$M_weights = P_total × sizeof(dtype)$
3	equivalence	Gradient memory is exact	$M_grad = P_total × 4 (always f32)$
4	equivalence	Optimizer memory is exact for AdamW	$M_opt = P_total × 8 (two f32 state tensors)$
5	bound	Activation memory is upper bound	$M_act_actual <= L × S × H × K × 4$

Falsification Tests

ID	Rule	Prediction	If Fails
FALSIFY-MEM-001	Parameter count matches model	P_total from formula equals Transformer::parameters().len() sum of element counts	Architecture equation wrong or model has extra parameters
FALSIFY-MEM-002	Activation upper bound holds	Peak RSS during forward pass <= M_act formula	K factor too low, or hidden intermediate tensors not counted
FALSIFY-MEM-003	Total estimate covers actual GPU usage	nvidia-smi peak memory <= M_total	Missing memory component or CUDA overhead underestimated

Kani Harnesses

ID	Obligation	Bound	Strategy
KANI-MEM-001	MEM-EXACT-001	4	exhaustive

QA Gate

Training Memory Estimation Contract (F-MEM-001)

VRAM estimation correctness for apr train plan

Checks: parameter_count_exact, activation_upper_bound, total_covers_actual

Pass criteria: All 3 falsification tests pass

training-gpu-kernel-v1

Version: 1.0.0

GPU-resident pretraining kernel — CudaTransformerBlock wired into TransformerTrainer

References

classify_pipeline.rs GPU training pattern (ENT-151, ENT-152)
training-memory-kernel-v1.yaml (VRAM estimation)

Dependencies

training-memory-kernel-v1

Dependency Graph

graph LR
    training_gpu_kernel_v1["training-gpu-kernel-v1"] --> training_memory_kernel_v1["training-memory-kernel-v1"]

Equations

gpu_utilization

$$ util = compute_time / (compute_time + transfer_time + sync_time)

Domain: $Measured via nvidia-smi dmon or CUDA events$

Codomain: $GPU utilization ratio [0, 1]$

Invariants:

$util > 0.70 for models >= 350M params with batch_size >= 4$
$Previous CPU autograd achieved ~0.07 (7%) due to 16K transfers/step$

pcie_transfers_per_step

$$ T = 3 (constant) Transfer 1 (H2D): hidden = S × H × 4 bytes Transfer 2 (D2H): logits = S × V × 4 bytes Transfer 3 (H2D): grad_logits = S × V × 4 bytes Total bytes per step = S × (H + 2V) × 4

Domain: $S: seq_len, H: hidden_size, V: vocab_size $

Codomain: $T = 3: exactly 3 PCIe transfers per training step$

Invariants:

$Embedding lookup stays on CPU (scatter-gather, not matmul)$
$Cross-entropy loss + softmax backward stays on CPU$
$All transformer block forward/backward/optimizer on GPU$
$RMSNorm forward/backward on GPU$
$LM head GEMM forward/backward on GPU$

transfer_overhead

$$ overhead_ms = total_bytes / bandwidth For PCIe 4.0 x16: bandwidth = 32 GB/s For 350M model (H=1024, V=32K, S=2048): total = 2048 × (1024 + 2×32768) × 4 = 544 MB overhead = 544 MB / 32 GB/s = 17 ms

Domain: $Architecture params + PCIe bandwidth$

Codomain: $Transfer overhead in milliseconds (theoretical)$

Invariants:

$Transfer overhead < 5% of compute time for models >= 350M params$
$GPU compute time dominates for large models$

Proof Obligations

#	Type	Property	Formal
1	equivalence	GPU training loss matches CPU training loss	$\|loss_gpu(step=N) - loss_cpu(step=N)\| < epsilon for all N in [1, 100]$
2	invariant	Exactly 3 PCIe transfers per step	$count(H2D) + count(D2H) = 3 per train_step_single() call$
3	bound	GPU utilization exceeds 70%	$gpu_util >= 0.70 during training (measured over 100+ steps)$
4	invariant	Weight sync preserves values	$sync_weights_to_cpu() => \|w_cpu[i] - w_gpu[i]\| == 0 for all i$
5	invariant	Graceful fallback on CUDA failure	$CudaTransformerTrainer::new() Err => TransformerTrainer used instead$

Falsification Tests

ID	Rule	Prediction	If Fails
FALSIFY-GPU-001	GPU and CPU training produce equivalent loss	After 10 steps with identical init, \|loss_gpu - loss_cpu\| < 1e-3	Numerical divergence in GPU kernels or incorrect gradient flow
FALSIFY-GPU-002	Saved weights differ from init after GPU training	model.safetensors weights != init weights after 10+ steps	Weight sync broken or optimizer not updating GPU weights
FALSIFY-GPU-003	Fallback works when CUDA unavailable	train_from_yaml succeeds with use_cuda=true but no GPU	Fallback path broken or non-CUDA stub missing
FALSIFY-GPU-004	GPU utilization > 70% for 350M model	nvidia-smi dmon shows >70% GPU utilization during training	Unexpected PCIe bottleneck, kernel launch overhead, or memory contention

QA Gate

GPU-Resident Pretraining Contract (F-GPU-001)

CudaTransformerTrainer correctness and efficiency

Checks: numerical_equivalence, transfer_count_invariant, gpu_utilization_bound, weight_sync_exact, graceful_fallback

Pass criteria: All 4 falsification tests pass

Training Step Budget Contract

Contract: contracts/training-step-budget-v1.yaml Version: 1.0.0 Status: NEW (ALB-075) Depends on: training-gpu-kernel-v1, cublas-gemm-v1

Equations

step_time_budget

T_step = T_gemm + T_optimizer + T_embedding + T_pcie + T_elementwise
       + T_cross_entropy + T_stream_sync + T_overhead

Every component maps to exactly one probador brick. Budget violation (> 2x) triggers Jidoka alert.

gemm_throughput

TFLOP_per_step = sum(2 * m * n * k / 1e12 for all ~555 GEMMs)
T_gemm = TFLOP_per_step / achieved_tflops

PTX baseline: ~2 TFLOP/s
cuBLAS target: >= 100 TFLOP/s

mfu_definition

MFU = (6 * P * tokens_per_step) / (T_step * peak_flops)
P = 370M, tokens_per_step = 4096
peak_flops(FP16, sustained) = 148 TFLOP/s

Proof Obligations (4)

ID	Type	Property
1	bound	Brick budgets cover >= 95% of step time
2	bound	GEMM dominates PTX baseline (> 50%)
3	bound	cuBLAS reduces GEMM time by >= 5x
4	bound	MFU improves monotonically across phases

Falsification Tests (4)

ID	Rule	Prediction
FALSIFY-BUDGET-001	Brick coverage >= 95%	`T_step - sum(bricks) < 0.05 * T_step`
FALSIFY-BUDGET-002	GEMM is primary bottleneck	`T_gemm > 50%` of step time
FALSIFY-BUDGET-003	Jidoka gate fires	Injected delay pauses training
FALSIFY-BUDGET-004	Baseline matches estimate	GEMM fraction in [50%, 65%]

QA Gate

F-BUDGET-001: All 4 falsification tests must pass before optimization phase targets are considered valid.

cuBLAS GEMM Integration Contract

Contract: contracts/cublas-gemm-v1.yaml Version: 1.0.0 Status: NEW (ALB-075) Depends on: training-gpu-kernel-v1, training-memory-kernel-v1

Equations

cublas_gemm_correctness

C_cublas = alpha * op(A) * op(B) + beta * C
where op(X) = X if transa=N, X^T if transa=T
A: FP16 [m, k], B: FP16 [k, n], C: FP16 [m, n]
Accumulation: FP32 (CUBLAS_COMPUTE_32F)

max_abs_diff(C_cublas, C_ptx) < 1e-2 for identical inputs
cuBLAS uses tensor cores when math mode is TENSOR_OP_MATH
FP32 accumulation prevents catastrophic cancellation

buffer_size_verification

For cublasGemmEx(m, n, k, A, B, C):
  A.len() >= m * k * 2  (FP16)
  B.len() >= k * n * 2  (FP16)
  C.len() >= m * n * 2  (FP16)

Verified at call site, not inside cuBLAS. Assertion failure = immediate panic.

handle_lifecycle

create: cublasCreate_v2(&handle) -> CUBLAS_STATUS_SUCCESS
bind:   cublasSetStream_v2(handle, stream) once per training step
drop:   cublasDestroy_v2(handle) exactly once

One handle per CudaContext (thread-safe within context)
Stream set ONCE per step, not per GEMM (555 calls = measurable overhead)
Handle destroyed on Drop (Rust RAII)

ffi_overhead

overhead = T_rust_cublas / T_raw_c_cublas < 1.02

For identical GEMM shape, same GPU, same cuBLAS config. Measured via CUDA events, not wall clock. Warmup: 50 iterations discarded before measurement.

mfu_improvement

MFU = (6 * P * tokens_per_step) / (T_step * peak_flops)
P = 370M, tokens_per_step = 4096
peak_flops(FP16, sustained) = 148 TFLOP/s

MFU(cublas) > MFU(ptx) (strict improvement)
MFU(cublas) >= 0.025 (must beat current 2.5% FP32 baseline)

mixed_precision_weight_flow

CPU master weights: FP32 (optimizer operates here)
GPU forward weights: FP16 (cast during upload)
GPU activation gradients: FP16 (cuBLAS backward output)
GPU weight gradients: FP32 (accumulated in FP32 buffer)
CPU gradient download: FP32 (for optimizer update)

Master weights ALWAYS FP32 on CPU (no precision loss in optimizer)
C-EMBED-GRAD-001 still holds: activation grad clipped before CPU scatter-add
C-HYPERPARAMS-001 still holds: all optimizer params from YAML config

Proof Obligations (8)

ID	Type	Property
1	equivalence	cuBLAS GEMM matches PTX GEMM (max_abs_diff < 1e-2)
2	invariant	Buffer sizes verified before every cublasGemmEx
3	invariant	cuBLAS handle lifecycle is RAII
4	bound	FFI overhead < 2%
5	bound	MFU improves over baseline
6	invariant	Training stability preserved (loss.is_finite())
7	invariant	Gradient flow preserved (grad != 0 for all params)
8	invariant	FP32 accumulation enforced (CUBLAS_COMPUTE_32F)

Falsification Tests (11)

ID	Rule	Prediction
FALSIFY-CUBLAS-001	Forward matches PTX	`max_abs_diff(logits) < 1e-2` on 50M
FALSIFY-CUBLAS-002	Training stable 50 steps	Loss finite, within 5% of PTX baseline
FALSIFY-CUBLAS-003	GEMM > 100 TFLOP/s	`[4096,1024] x [1024,4096]` isolated GEMM
FALSIFY-CUBLAS-004	Step time improves	350M < 3.0s (vs 4.4s PTX)
FALSIFY-CUBLAS-005	Buffer overflow impossible	Undersized buffer panics, no silent corruption
FALSIFY-CUBLAS-006	All params get gradients	`max(\|grad\|) > 0` for 110 params after 1 step
FALSIFY-CUBLAS-007	C-EMBED-GRAD-001 preserved	Activation grad clipped before CPU scatter-add
FALSIFY-CUBLAS-008	FFI overhead < 2%	`T_rust / T_raw_c < 1.02` for all shapes
FALSIFY-CUBLAS-009	Non-GEMM overhead stable	`T_non_gemm(cublas) < 1.1 * T_non_gemm(ptx)`
FALSIFY-CUBLAS-010	GQA thin-matrix benefits	`[4096,256,1024]` > 50 TFLOP/s
FALSIFY-CUBLAS-011	Column-major convention	Row-major Rust buffers correct via transpose flags

Kani Harness

KANI-CUBLAS-001: Buffer size assertion prevents overflow for all valid GEMM shapes (exhaustive, bound=8).

QA Gate

F-CUBLAS-001: All 11 falsification tests must pass before cuBLAS backend replaces PTX for training.

Fused Kernel Optimizations Contract

Contract: contracts/fused-kernels-v1.yaml Version: 1.0.0 Status: NEW (ALB-075 Phase 4+) Depends on: cublas-gemm-v1, training-gpu-kernel-v1, training-step-budget-v1 Source: unslothai/unsloth analysis

Equations

fused_cross_entropy

For each row r in logits [B*S, V]:
  logsumexp_r = log(sum(exp(logit[r, i])))
  loss_r = logsumexp_r - logit[r, label_r]
  grad_r[i] = exp(logit[r, i] - logsumexp_r) - delta(i, label_r)

Single kernel pass. FP32 accumulation. Softmax tensor never materialized. Backward grad overwrites logits buffer in-place (zero extra allocation).

rmsnorm_activation_reuse

Forward: save ONLY inv_var [B*S] (not normed — recompute in backward)
Backward: normed = X_cached * inv_var_saved (bit-exact recompute)
Memory savings: 24 layers * B * S * H * 4 bytes = 384 MB

swiglu_inplace_backward

d_up = grad_output * silu(gate)          → written to up buffer
d_gate = grad_output * up * silu'(gate)  → written to gate buffer

gate and up consumed before overwrite. Peak workspace reduced by 128 MB.

rope_head_grouping

Load sin/cos once per group (G=4 heads)
Apply to all heads in group with single memory load
Q: 4 groups of 4, K: 1 group of 4

Bit-exact with per-head RoPE. ~10% attention speedup from L2 cache reuse.

fused_tiled_attention

For tile_q, tile_k in tiled [0, S):
  scores_tile = Q[tile_q] @ K[tile_k]^T / sqrt(d_k)
  Online softmax (Milakov & Gimelshein 2018):
    m_new = max(m_old, max(scores_tile))
    l_new = l_old * exp(m_old - m_new) + sum(exp(scores_tile - m_new))
  O += exp(scores_tile - m_new) @ V[tile_k]
O = O / l_new

Full [S, S] attention matrix never materialized. Memory: O(BHSd_k) instead of O(BHSS). Saves 256 MB per layer.

chunked_cross_entropy (deferred)

For vocab > 65K: split logsumexp into 65K chunks. Mathematically exact (logsumexp is associative). Current vocab=32K: single chunk, no overhead.

Proof Obligations (10)

ID	Type	Property
1	equivalence	Fused CE matches separate CE (< 1e-5)
2	invariant	Fused CE never allocates softmax tensor
3	equivalence	RMS norm recompute is bit-exact
4	bound	Activation memory reduced by >= 300 MB
5	equivalence	SwiGLU in-place backward correct (< 1e-5)
6	equivalence	RoPE grouped matches individual (bitwise)
7	equivalence	Fused attention matches separate (< 1e-3)
8	bound	Fused attention memory < separate / 4
9	invariant	Training stability preserved (loss finite)
10	invariant	Gradient flow preserved (all params)

Falsification Tests (10)

ID	Rule	Prediction
FALSIFY-FUSED-001	Fused CE matches separate	`max_abs_diff(loss) < 1e-5` 50 steps
FALSIFY-FUSED-002	RMS norm recompute exact	Bitwise match all 24 layers
FALSIFY-FUSED-003	SwiGLU in-place correct	`max_abs_diff(d_gate, d_up) < 1e-5`
FALSIFY-FUSED-004	RoPE grouped matches	Bit-exact 16 Q + 4 K heads
FALSIFY-FUSED-005	Fused attention matches	`max_abs_diff < 1e-3` (FP32)
FALSIFY-FUSED-006	Memory savings >= 300 MB	Activation peak reduction measured
FALSIFY-FUSED-007	No full softmax alloc	Peak CE memory < `BSV*4`
FALSIFY-FUSED-008	Grad checkpoint exact	Bitwise gradient match
FALSIFY-FUSED-009	Fused attn backward OK	All params get grads, loss within 1%
FALSIFY-FUSED-010	No instability	100 steps, loss finite, gnorm < 100

Priority Matrix

#	Optimization	Gain	Memory	Phase
1	Fused CE loss	20-40ms/step	-512 MB bandwidth	4
2	RMS norm reuse	0 compute	-384 MB	4
3	SwiGLU in-place	10-20ms/step	-128 MB peak	4
4	RoPE grouping	5-10ms/step	0	4
5	Fused attention	15% attn speedup	-256 MB/layer	5
6	Chunked CE	future	0	Deferred
7	Grad checkpoint	-2x backward	-66% activations	7

QA Gate

F-FUSED-001: All 10 falsification tests must pass. If combined run shows instability, bisect fusions individually to identify the culprit.

Training Performance Specification

0. Design Principles

This specification follows design by contract (DbC). Every performance claim, optimization target, and implementation phase begins with a provable contract (pv validate) that defines equations, invariants, proof obligations, and falsification tests. Code is written to satisfy the contract — never the reverse.

Verification stack (sovereign, no external dependencies):

Layer	Tool	Role
Contract	`pv` (provable-contracts)	YAML equations, proof obligations, falsification tests, Kani harnesses
Benchmark	Raw C + Criterion + regression	Three-tier: raw C cuBLAS (ceiling) vs Rust cuBLAS vs PTX (floor)
Profiling	`probador` (probar)	Brick budgets, per-component SLA enforcement, Jidoka gates
Tracing	`renacer` (BrickTracer)	Per-kernel/per-block/per-transfer spans, OTLP export, anomaly escalation
Measurement	`renacer` (metrics)	Counter/Gauge/Histogram with SIMD acceleration (trueno)

Workflow for every optimization phase:

1. pv validate contracts/cublas-gemm-v1.yaml          # Contract first
2. pv scaffold contracts/cublas-gemm-v1.yaml           # Generate test stubs
3. make bench-gemm-raw                                 # Establish ceiling
4. Implement against contract
5. make bench-gemm-compare                             # Three-tier benchmark
6. probador brick budgets verify per-component SLAs    # Brick profiling
7. renacer --trace-compute traces per-kernel timing    # Layer tracing
8. pv audit contracts/cublas-gemm-v1.yaml              # Binding coverage
9. Dogfood on 350M training run
10. make bench-gemm-regression                         # No regressions
11. Close gap in §11

1. Current Performance Baseline

1.1 Measured Throughput

Metric	Value	Config
Throughput (pre-optimization)	934 tok/s	350M, seq=1024, batch=4, RTX 4090
Step time (pre-optimization)	~4.4s	Same config
Throughput (current, Phase 5b)	7,676 tok/s	Same config (steady state, step 1000)
Step time (current, Phase 5b)	513 ms	Same config (steady state)
MFU (current, Phase 5b)	22.2%	vs FP32 peak (as reported by trainer)
VRAM usage	~11.6 GB / 24 GB	Same config
Training loss (v3, step 26K)	6.61	v3 run (PID 1975811, codeparrot-clean)
Validation loss (v3, step 26K)	6.91	val_ppl=1000.3
Loss trajectory (v3)	10.40 → 6.61 (step 26K)	v3 run (250K steps target)
Gradient norm (v3)	3.04 → 0.13 (step 1K → 26K)	Monotonic decrease
Tokens processed (v3)	108M	26,400 × 4 × 1024

1.2 MFU Analysis

Model FLOPs Utilization (MFU) measures actual compute throughput against hardware theoretical peak. For a transformer forward+backward pass, the standard approximation is 6 x params x tokens_per_step FLOPs.

Model parameters:       370M (24 layers, hidden=1024, intermediate=4096)
Tokens per step:        4 x 1024 = 4,096 tokens
FLOPs per step:         6 x 370M x 4,096 = 9.1 TFLOP

Step time:              4.4s
Achieved FLOP/s:        9.1 TFLOP / 4.4s = 2.07 TFLOP/s

RTX 4090 FP16 peak:    165 TFLOP/s (with tensor cores)
RTX 4090 FP32 peak:    82.6 TFLOP/s (without tensor cores)

MFU (vs FP16 peak):    2.07 / 165 = 1.3%
MFU (vs FP32 peak):    2.07 / 82.6 = 2.5%

MFU = 2.5% (vs FP32 peak) / 1.3% (vs FP16 peak)

1.3 Research Benchmarks for Context

System	Model Size	Hardware	MFU	Source
GPT-3 (OpenAI)	175B	A100 cluster	21%	Brown et al. 2020
PaLM (Google)	540B	TPU v4	46-57%	Chowdhery et al. 2022
LLaMA (Meta)	65B	A100 80GB	36%	Touvron et al. 2023
Chinchilla (DeepMind)	70B	TPU v3/v4	~40%	Hoffmann et al. 2022
Typical single-GPU PyTorch	350M	RTX 4090	25-35%	Community benchmarks
Albor (current)	370M	RTX 4090	2.5%	Measured

The gap is 10-15x vs what the hardware can deliver for this model size.

1.4 Baseline Profiling Protocol (renacer + probador)

Before any optimization, establish ground truth with brick-level profiling:

# Layer-level tracing: per-kernel timing for one training step
renacer --otlp-endpoint http://localhost:4317 \
        --otlp-service-name "albor-baseline" \
        --trace-compute \
        --trace-compute-threshold 100 \
        -- apr train apply --task pretrain \
            --config configs/train/pretrain-350m-cuda-test.yaml

# View in Jaeger: http://localhost:16686 -> Service: "albor-baseline"
# Each GEMM kernel, norm kernel, PCIe transfer is a span with duration_us

BrickTracer escalation thresholds for baseline measurement:

#![allow(unused)]
fn main() {
let thresholds = BrickEscalationThresholds::default()
    .with_cv(15.0)         // Escalate if kernel timing CV > 15%
    .with_efficiency(25.0)  // Escalate if compute efficiency < 25%
    .with_rate_limit(100);  // Max 100 traces/second during profiling
}

Brick budget breakdown (probador) — defines the per-component SLA that each optimization phase must improve:

#![allow(unused)]
fn main() {
let step_budget = BrickHouseBuilder::new("training-step")
    .budget_ms(4400)                      // Current step time
    .brick("gemm_forward",     1400)      // 7 GEMMs x 24 blocks + LM head
    .brick("gemm_backward",    1100)      // 14 GEMMs x 24 blocks + LM head
    .brick("cpu_optimizer",     800)      // 24 blocks + LM head + embedding
    .brick("cpu_embedding",     200)      // Scatter-gather forward + backward
    .brick("pcie_transfer",     150)      // 3 transfers (H2D embed, D2H logits, H2D grad)
    .brick("elementwise_kernel", 100)     // RMSNorm, RoPE, SiLU
    .brick("cross_entropy",      50)      // Fused CE forward + backward
    .brick("stream_sync",        50)      // ALB-065 synchronization
    .brick("overhead",          550)      // Scheduling, allocator, host logic
    .build()?;
}

Each brick has a Jidoka gate: if any component exceeds its budget by >2x after an optimization, training stops and alerts. This prevents silent regressions.

2. Root Cause Analysis

2.1 The GEMM Bottleneck

A 350M transformer forward+backward step executes 552 GEMM operations:

Per transformer block (24 blocks):
  Forward:
    - Q projection:    GEMM [S, H] x [H, H]     (1)
    - K projection:    GEMM [S, H] x [H, H_kv]  (1)
    - V projection:    GEMM [S, H] x [H, H_kv]  (1)
    - Attention out:   GEMM [S, H] x [H, H]     (1)
    - FFN gate:        GEMM [S, H] x [H, I]     (1)
    - FFN up:          GEMM [S, H] x [H, I]     (1)
    - FFN down:        GEMM [S, I] x [I, H]     (1)
  Backward (roughly 2x forward):
    - dQ, dK, dV, dAttn_out, dGate, dUp, dDown  (7)
    - Weight gradients for each of the above     (7)
  Subtotal per block: 7 + 14 = 21 GEMMs

LM head (vocab projection):
  Forward:   GEMM [S, H] x [H, V]               (1)
  Backward:  GEMM for dInput + dWeight           (2)
  Subtotal: 3 GEMMs

Embedding (scatter-add, not GEMM):              (0)

Total: 24 x 21 + 3 = 507 weight GEMMs
       + attention score GEMMs: 24 x 2 = 48 (QK^T forward + backward)
       = 555 GEMM operations per step

2.2 Hand-Written PTX vs Tensor Cores

All GEMMs use hand-written PTX tiled GEMM kernels in trueno-gpu:

GemmForwardKernel::tiled_unrolled() — FP32 accumulation, no tensor cores
GemmBackwardAKernel::tiled_unrolled() — Input gradient GEMM
GemmBackwardBKernel::tiled_unrolled() — Weight gradient GEMM

These kernels:

Use scalar FP32 FMA instructions (fma.rn.f32)
Tile sizes are small (typically 16x16 or 32x32)
No shared memory double-buffering or software pipelining
Cannot use tensor cores (require wmma or mma PTX instructions)

The RTX 4090 (Ada Lovelace, SM 8.9) has 128 FP32 CUDA cores per SM x 128 SMs = 16,384 CUDA cores. But it also has 4th generation tensor cores that deliver 165 TFLOP/s FP16 — 2x the FP32 throughput — and these are completely unused.

2.3 Non-GEMM Overhead

Component	Approximate Time	Notes
PCIe transfers (3 per step)	~50-100ms	H2D embed, D2H logits, H2D grad_logits
CPU embedding forward/backward	~100-200ms	Scatter-gather on CPU, not GPU
Per-block optimizer step (CPU)	~500-800ms	AdamW on CPU for each of 24 blocks
RMSNorm, RoPE, SiLU kernels	~50ms	Small element-wise kernels
Fused cross-entropy	~20ms	Custom PTX kernel
Stream synchronization	~10-50ms	ALB-065: required before D2H

The per-block CPU optimizer (download gradients -> AdamW on CPU -> upload weights) is the second largest bottleneck after GEMM throughput. ALB-067 disabled per-block gradient clipping due to CPU-side L2 norm cost (864 D2H transfers/step).

2.4 Step Time Breakdown (Estimated)

Total step time:          4,400 ms (100%)
+-- 555 GEMM operations:  2,500 ms ( 57%)  <-- PRIMARY BOTTLENECK
+-- CPU optimizer (24x):    800 ms ( 18%)  <-- SECONDARY BOTTLENECK
+-- CPU embedding:          200 ms (  5%)
+-- PCIe transfers:         150 ms (  3%)
+-- Element-wise kernels:   100 ms (  2%)
+-- Cross-entropy:           50 ms (  1%)
+-- Stream sync:             50 ms (  1%)
+-- Overhead (Python-free):  550 ms ( 13%)

2.5 Confirming the Breakdown: Layer Tracing Protocol

The estimated breakdown in 2.4 must be confirmed with measurement before optimizing. Renacer BrickTracer provides per-brick isolation:

#![allow(unused)]
fn main() {
// In entrenar CudaTransformerTrainer::train_step_single()
let tracer = BrickTracer::new_local();

// Trace each phase as a separate brick
let embed_result = tracer.trace("embed_forward", 200, || {
    // CPU scatter-gather embedding lookup
    embed_forward(&input_ids, &embed_weight)
});

let h2d_result = tracer.trace("pcie_h2d_hidden", 50, || {
    hidden_buf.copy_from_host(&hidden_states)
});

for block_idx in 0..24 {
    let fwd_result = tracer.trace(
        &format!("block_{}_forward", block_idx), 100, || {
            block.forward(&workspace)
        }
    );
    // BrickTracer records: duration_us, budget_us, efficiency, over_budget
}
}

Escalation: When any brick’s CV exceeds 15% (unstable timing) or efficiency drops below 25% (idle GPU), BrickTracer automatically captures full syscall-level traces and exports as OTLP spans. This is the renacer “measurement -> tracing” escalation pattern — lightweight metrics in steady state, detailed tracing only on anomaly.

The confirmed breakdown becomes the contract baseline that optimization phases are proven against.

3. Contracts: Write Before Code

3.1 Contract: cuBLAS GEMM Integration

File: contracts/cublas-gemm-v1.yaml

This contract must be written and validated (pv validate) before any cuBLAS code is written. It defines the algebraic invariants, numerical bounds, and falsification tests that the implementation must satisfy.

# contracts/cublas-gemm-v1.yaml
metadata:
  version: "1.0.0"
  created: "2026-03-05"
  author: "PAIML Engineering"
  description: "cuBLAS tensor core GEMM integration for training throughput"
  references:
    - "Micikevicius et al. (2018) Mixed Precision Training"
    - "NVIDIA cuBLAS Documentation (CUDA 12.x)"
    - "training-gpu-kernel-v1.yaml (parent contract)"
  depends_on:
    - "training-gpu-kernel-v1"
    - "training-memory-kernel-v1"

equations:
  cublas_gemm_correctness:
    formula: |
      C_cublas = alpha * op(A) * op(B) + beta * C
      where op(X) = X if transa=N, X^T if transa=T
      A: FP16 [m, k], B: FP16 [k, n], C: FP16 [m, n]
      Accumulation: FP32 (CUBLAS_COMPUTE_32F)
    domain: "FP16 input buffers, FP32 accumulation, FP16 output"
    codomain: "C_cublas: FP16 result matrix"
    invariants:
      - "max_abs_diff(C_cublas, C_ptx) < 1e-2 for identical inputs"
      - "cuBLAS uses tensor cores when math mode is TENSOR_OP_MATH"
      - "FP32 accumulation prevents catastrophic cancellation"

  buffer_size_verification:
    formula: |
      For cublasGemmEx(m, n, k, A, B, C):
        A.len() >= m * k * sizeof(FP16) = m * k * 2
        B.len() >= k * n * sizeof(FP16) = k * n * 2
        C.len() >= m * n * sizeof(FP16) = m * n * 2
    domain: "GpuBuffer lengths in bytes"
    codomain: "Boolean: all buffers sufficient"
    invariants:
      - "Verified at call site, not inside cuBLAS (Rule 2: prove at kernel boundary)"
      - "Assertion failure = immediate panic, not silent corruption"

  handle_lifecycle:
    formula: |
      create: cublasCreate_v2(&handle) -> CUBLAS_STATUS_SUCCESS
      bind:   cublasSetStream_v2(handle, stream) before every GEMM
      drop:   cublasDestroy_v2(handle) exactly once
    invariants:
      - "One handle per CudaContext (thread-safe within context)"
      - "Stream set before EVERY cublasGemmEx call (C-STREAMSYNC-001 extension)"
      - "Handle destroyed on Drop (Rust RAII)"
      - "No default stream usage — always explicit non-blocking stream"

  mfu_improvement:
    formula: |
      MFU = achieved_flops / hardware_peak_flops
      achieved_flops = 6 * P * tokens_per_step / step_time
      P = 370M, tokens_per_step = 4096
      hardware_peak_flops(FP16) = 165 TFLOP/s
    domain: "Measured step_time after cuBLAS integration"
    codomain: "MFU ratio [0, 1]"
    invariants:
      - "MFU(cublas) > MFU(ptx) (strict improvement)"
      - "MFU(cublas) >= 0.025 (must beat current 2.5% FP32 baseline)"

  mixed_precision_weight_flow:
    formula: |
      CPU master weights: FP32 (optimizer operates here)
      GPU forward weights: FP16 (cast during upload)
      GPU activation gradients: FP16 (cuBLAS backward output)
      GPU weight gradients: FP32 (accumulated in FP32 buffer)
      CPU gradient download: FP32 (for optimizer update)
    invariants:
      - "Master weights ALWAYS FP32 on CPU (no precision loss in optimizer)"
      - "Weight gradient accumulation in FP32 (no underflow in small gradients)"
      - "C-EMBED-GRAD-001 still holds: activation grad clipped before CPU scatter-add"
      - "C-HYPERPARAMS-001 still holds: all optimizer params from YAML config"

proof_obligations:
  - type: equivalence
    property: "cuBLAS GEMM matches PTX GEMM"
    formal: "max_abs_diff(C_cublas, C_ptx) < 1e-2 for all GEMM shapes in training"
    tolerance: 1e-2
    applies_to: cublas_gemm_correctness

  - type: invariant
    property: "Buffer sizes verified before every cublasGemmEx"
    formal: "assert!(buf.len() >= required) precedes every cublasGemmEx call"
    tolerance: 0
    applies_to: buffer_size_verification

  - type: invariant
    property: "cuBLAS handle lifecycle is RAII"
    formal: "create() in new(), destroy() in Drop, set_stream() before gemm()"
    tolerance: 0
    applies_to: handle_lifecycle

  - type: bound
    property: "MFU improves over baseline"
    formal: "MFU(cublas, 50 steps) > MFU(ptx, 50 steps)"
    applies_to: mfu_improvement

  - type: invariant
    property: "Training stability preserved"
    formal: "loss.is_finite() for all steps in 100-step run"
    tolerance: 0
    applies_to: training_stability

  - type: invariant
    property: "Gradient flow preserved"
    formal: "max(|grad(param)|) > 0 for all trainable params after 1 step"
    tolerance: 0
    applies_to: gradient_flow

  - type: invariant
    property: "FP32 accumulation enforced"
    formal: "computeType == CUBLAS_COMPUTE_32F for every cublasGemmEx call"
    tolerance: 0
    applies_to: cublas_gemm_correctness

falsification_tests:
  - id: FALSIFY-CUBLAS-001
    rule: "cuBLAS forward matches PTX forward"
    prediction: "max_abs_diff(logits_cublas, logits_ptx) < 1e-2 on 50M model"
    test: |
      Build TransformerConfig::tiny(), forward same input through both backends.
      Compare logit tensors element-wise.
    if_fails: "cuBLAS transpose convention or leading dimension wrong"

  - id: FALSIFY-CUBLAS-002
    rule: "cuBLAS training stable for 50 steps"
    prediction: "Loss is finite at every step, loss curve within 5% of PTX baseline"
    test: |
      Train 50M model for 50 steps with cuBLAS backend.
      Train same model for 50 steps with PTX backend.
      Compare loss at step 50: |loss_cublas - loss_ptx| / loss_ptx < 0.05.
    if_fails: "FP16 precision insufficient for this model or gradient accumulation broken"

  - id: FALSIFY-CUBLAS-003
    rule: "GEMM throughput exceeds 100 TFLOP/s"
    prediction: "Isolated GEMM [4096, 1024] x [1024, 4096] > 100 TFLOP/s"
    test: |
      Run 1000 iterations of cublasGemmEx on [4096, 1024] x [1024, 4096].
      Compute FLOP/s = 2 * 4096 * 1024 * 4096 * 1000 / elapsed_seconds.
    if_fails: "Tensor cores not engaged, wrong math mode, or memory bandwidth bound"

  - id: FALSIFY-CUBLAS-004
    rule: "Step time improves over PTX baseline"
    prediction: "350M step time < 3.0s with cuBLAS (vs 4.4s with PTX)"
    test: |
      Run pretrain-350m-cuda-test.yaml for 50 steps with cuBLAS.
      Measure median step time. Must be < 3.0s.
    if_fails: "GEMM is not the bottleneck or cuBLAS adds unexpected overhead"

  - id: FALSIFY-CUBLAS-005
    rule: "Buffer overflow impossible"
    prediction: "cuBLAS wrapper panics if buffer too small (never silent corruption)"
    test: |
      Call gemm_f16() with undersized C buffer (m*n*2 - 1 bytes).
      Must panic with assertion failure, not proceed to cublasGemmEx.
    if_fails: "Buffer verification missing or assertion not checked"

  - id: FALSIFY-CUBLAS-006
    rule: "All trainable parameters receive gradients"
    prediction: "max(|grad|) > 0 for every param after 1 cuBLAS training step"
    test: |
      Train 50M model for 1 step with cuBLAS. Check gradient of all 110 params.
    if_fails: "cuBLAS backward produces zero gradients (wrong transpose or alpha/beta)"

  - id: FALSIFY-CUBLAS-007
    rule: "C-EMBED-GRAD-001 preserved under cuBLAS"
    prediction: "Activation gradient clipped before CPU scatter-add even with cuBLAS"
    test: |
      Train 24-layer 350M for 1 step with cuBLAS. Verify activation gradient
      L2 norm <= max_grad_norm before embedding backward.
    if_fails: "cuBLAS backward bypasses activation gradient clipping path"

kani_harnesses:
  - id: KANI-CUBLAS-001
    obligation: CUBLAS-INV-002
    property: "Buffer size assertion prevents overflow for all valid GEMM shapes"
    bound: 8
    strategy: exhaustive
    harness: verify_buffer_assertion_complete

qa_gate:
  id: F-CUBLAS-001
  name: "cuBLAS GEMM Integration Contract"
  description: "Correctness, stability, performance, and safety for cuBLAS tensor core GEMMs"
  checks:
    - "cublas_gemm_correctness"
    - "buffer_size_verification"
    - "handle_lifecycle"
    - "mfu_improvement"
    - "training_stability"
    - "gradient_flow"
  pass_criteria: "All 7 falsification tests pass"
  falsification: "Use wrong transpose to detect GEMM shape errors (ALB-059 class)"

3.2 Contract: Training Step Performance Budget

File: contracts/training-step-budget-v1.yaml

This contract defines the per-brick performance budget that probador enforces.

# contracts/training-step-budget-v1.yaml
metadata:
  version: "1.0.0"
  created: "2026-03-05"
  author: "PAIML Engineering"
  description: "Training step performance budget — brick-level SLAs with Jidoka gates"
  references:
    - "training-gpu-kernel-v1.yaml"
    - "ALB-067: CPU-side gradient clipping bottleneck"
  depends_on:
    - "training-gpu-kernel-v1"
    - "cublas-gemm-v1"

equations:
  step_time_budget:
    formula: |
      T_step = T_gemm + T_optimizer + T_embedding + T_pcie + T_elementwise
             + T_cross_entropy + T_stream_sync + T_overhead
    domain: "Per-component timing measured by renacer BrickTracer"
    codomain: "T_step: total step time in milliseconds"
    invariants:
      - "T_step is sum of brick times (no unaccounted gaps > 5% of total)"
      - "Every component maps to exactly one probador brick"
      - "Brick budget violation triggers Jidoka alert (training pause)"

  gemm_throughput:
    formula: |
      TFLOP_per_gemm(m, n, k) = 2 * m * n * k / 1e12
      TFLOP_per_step = sum(TFLOP_per_gemm for all 555 GEMMs)
      T_gemm = TFLOP_per_step / achieved_tflops
    invariants:
      - "PTX baseline: achieved_tflops ~= 2 TFLOP/s (FP32 scalar)"
      - "cuBLAS target: achieved_tflops >= 100 TFLOP/s (FP16 tensor core)"

  mfu_definition:
    formula: |
      MFU = (6 * P * tokens_per_step) / (T_step * peak_flops)
      P = 370M, tokens_per_step = batch * seq_len = 4096
      peak_flops(FP16) = 165 TFLOP/s, peak_flops(FP32) = 82.6 TFLOP/s
    invariants:
      - "MFU is measured over >= 50 steps (warm cache, excluding first 5)"
      - "Report both FP16 and FP32 MFU for clarity"

proof_obligations:
  - type: bound
    property: "Brick budgets account for full step time"
    formal: "sum(brick_budgets) >= 0.95 * T_step_measured"
    applies_to: step_time_budget

  - type: bound
    property: "GEMM brick dominates baseline"
    formal: "T_gemm / T_step > 0.50 in PTX baseline"
    applies_to: gemm_throughput

  - type: bound
    property: "cuBLAS reduces GEMM brick time by >= 5x"
    formal: "T_gemm(cublas) < T_gemm(ptx) / 5"
    applies_to: gemm_throughput

  - type: bound
    property: "MFU improves monotonically across phases"
    formal: "MFU(phase_N+1) > MFU(phase_N) for each optimization phase"
    applies_to: mfu_definition

falsification_tests:
  - id: FALSIFY-BUDGET-001
    rule: "Brick budgets cover >= 95% of step time"
    prediction: "T_step - sum(bricks) < 0.05 * T_step"
    test: |
      Run 50-step profiling with BrickTracer on 350M model.
      Sum all brick durations. Compare to total step time.
    if_fails: "Unaccounted overhead — missing brick or hidden synchronization"

  - id: FALSIFY-BUDGET-002
    rule: "GEMM is the primary bottleneck in PTX baseline"
    prediction: "T_gemm > 50% of T_step in PTX mode"
    test: |
      Profile 50 steps with PTX backend, isolate GEMM brick time.
    if_fails: "Bottleneck is elsewhere — revisit optimization target"

  - id: FALSIFY-BUDGET-003
    rule: "Jidoka gate fires on 2x budget violation"
    prediction: "If T_gemm > 2 * budget_gemm, training pauses with alert"
    test: |
      Inject artificial 10s delay in GEMM kernel. Verify Jidoka gate
      fires and training loop emits Andon alert.
    if_fails: "Budget enforcement not wired into training loop"

qa_gate:
  id: F-BUDGET-001
  name: "Training Step Performance Budget Contract"
  checks:
    - "brick_coverage"
    - "gemm_dominance"
    - "jidoka_enforcement"
  pass_criteria: "All 3 falsification tests pass"

3.3 Contract Validation Workflow

# Validate both contracts before writing any code
pv validate contracts/cublas-gemm-v1.yaml
pv validate contracts/training-step-budget-v1.yaml

# Generate test scaffolding
pv scaffold contracts/cublas-gemm-v1.yaml -o trueno-gpu/tests/
pv scaffold contracts/training-step-budget-v1.yaml -o entrenar/tests/

# After implementation: audit binding coverage
pv audit contracts/cublas-gemm-v1.yaml \
    --binding contracts/trueno-gpu/cublas-binding.yaml

# After dogfooding: close gaps
pv audit contracts/training-step-budget-v1.yaml \
    --binding contracts/entrenar/step-budget-binding.yaml

4. cuBLAS Integration Plan

4.1 Why cuBLAS

cuBLAS is NVIDIA’s production GEMM library. It:

Uses tensor cores automatically (FP16 input -> FP32 accumulate -> FP16 output)
Has auto-tuned kernels for every GPU architecture since Volta
Handles tiling, shared memory staging, warp scheduling, and epilogue fusion
Delivers 80-95% of theoretical peak on large matrices

For the Albor GEMM shapes ([4096, 1024] x [1024, 4096] etc.), cuBLAS will use tensor cores, achieving 130-150 TFLOP/s on RTX 4090 vs the current ~2 TFLOP/s from scalar PTX.

4.2 Architecture

The integration lives in trueno-gpu (the CUDA backend crate), adding three new source files:

trueno-gpu/
+-- src/
    +-- cublas_sys.rs     # Raw FFI bindings (unsafe extern "C")
    +-- cublas.rs         # Safe Rust wrapper (CublasHandle, GemmConfig)
    +-- gemm.rs           # Existing hand-written PTX kernels
    +-- ...

4.2.1 `cublas_sys.rs` — FFI Bindings (~200 lines)

Minimal bindings for the subset of cuBLAS used by training:

#![allow(unused)]
fn main() {
// Core types
type cublasHandle_t = *mut std::ffi::c_void;

#[repr(C)]
enum cublasOperation_t {
    CUBLAS_OP_N = 0,  // No transpose
    CUBLAS_OP_T = 1,  // Transpose
}

#[repr(C)]
enum cublasStatus_t {
    CUBLAS_STATUS_SUCCESS = 0,
    // ... error codes
}

// Core functions
extern "C" {
    fn cublasCreate_v2(handle: *mut cublasHandle_t) -> cublasStatus_t;
    fn cublasDestroy_v2(handle: cublasHandle_t) -> cublasStatus_t;
    fn cublasSetStream_v2(handle: cublasHandle_t, stream: CUstream) -> cublasStatus_t;
    fn cublasSetMathMode(handle: cublasHandle_t, mode: cublasMath_t) -> cublasStatus_t;

    // The workhorse: C = alpha * op(A) * op(B) + beta * C
    fn cublasGemmEx(
        handle: cublasHandle_t,
        transa: cublasOperation_t,
        transb: cublasOperation_t,
        m: i32, n: i32, k: i32,
        alpha: *const f32,
        A: *const std::ffi::c_void, Atype: cudaDataType,
        lda: i32,
        B: *const std::ffi::c_void, Btype: cudaDataType,
        ldb: i32,
        beta: *const f32,
        C: *mut std::ffi::c_void, Ctype: cudaDataType,
        ldc: i32,
        computeType: cublasComputeType_t,
        algo: cublasGemmAlgo_t,
    ) -> cublasStatus_t;
}
}

Link against libcublas.so (ships with CUDA toolkit, already installed for trueno’s PTX compilation):

# trueno-gpu/build.rs
println!("cargo:rustc-link-lib=cublas");
println!("cargo:rustc-link-search=/usr/local/cuda/lib64");

4.2.2 `cublas.rs` — Safe Wrapper (~300 lines)

#![allow(unused)]
fn main() {
pub struct CublasHandle {
    handle: cublasHandle_t,
}

impl CublasHandle {
    pub fn new() -> Result<Self, CublasError> { ... }

    pub fn set_stream(&self, stream: &CudaStream) -> Result<(), CublasError> { ... }

    /// C = alpha * A x B + beta * C
    /// A: [m, k], B: [k, n], C: [m, n]
    /// Uses FP16 tensor cores with FP32 accumulation
    pub fn gemm_f16(
        &self,
        m: usize, n: usize, k: usize,
        alpha: f32,
        a: &GpuBuffer,  // FP16 [m, k]
        b: &GpuBuffer,  // FP16 [k, n]
        beta: f32,
        c: &mut GpuBuffer,  // FP16 [m, n]
    ) -> Result<(), CublasError> {
        // C-CUBLAS-003: Buffer sizes verified at kernel boundary (Rule 2)
        assert!(a.len() >= m * k * 2, "A buffer too small");
        assert!(b.len() >= k * n * 2, "B buffer too small");
        assert!(c.len() >= m * n * 2, "C buffer too small");

        unsafe {
            check_status(cublasGemmEx(
                self.handle,
                CUBLAS_OP_N, CUBLAS_OP_N,
                m as i32, n as i32, k as i32,
                &alpha,
                a.ptr(), CUDA_R_16F, m as i32,
                b.ptr(), CUDA_R_16F, k as i32,
                &beta,
                c.mut_ptr(), CUDA_R_16F, m as i32,
                CUBLAS_COMPUTE_32F,         // C-CUBLAS-004: FP32 accumulation
                CUBLAS_GEMM_DEFAULT_TENSOR_OP,
            ))
        }
    }
}

impl Drop for CublasHandle {
    fn drop(&mut self) {
        unsafe { cublasDestroy_v2(self.handle); }
    }
}
}

4.2.3 GEMM Kernel Variant — cuBLAS Backend

The existing GemmForwardKernel, GemmBackwardAKernel, GemmBackwardBKernel in trueno-gpu get a new variant that dispatches to cuBLAS instead of launching PTX. The selection is compile-time (feature flag cublas) or runtime (environment variable TRUENO_GEMM_BACKEND=cublas|ptx).

#![allow(unused)]
fn main() {
pub enum GemmBackend {
    Ptx,     // Existing hand-written PTX (fallback, reference implementation)
    Cublas,  // cuBLAS tensor core path (default when available)
}
}

4.3 Weight Storage Format Change

cuBLAS tensor core GEMMs require FP16 inputs for maximum throughput. Currently all weights are stored as FP32 on GPU. The integration requires:

Weight upload: Cast FP32 CPU weights to FP16 during H2D transfer
Gradient download: Keep FP32 for gradient accumulation and optimizer
Master weights: FP32 copy on CPU (already exists — CPU AdamW operates on FP32)
GPU weights: FP16 for forward/backward GEMMs

This is standard mixed-precision training (Micikevicius et al. 2018):

Forward pass: FP16 weights x FP16 activations -> FP16 output
Backward pass: FP16 weights x FP16 grad_output -> FP32 weight gradient
Optimizer: FP32 master weights updated with FP32 gradients

4.4 Estimated Code Size

Component	Lines	Complexity
`cublas_sys.rs` (FFI)	~200	Mechanical translation from CUDA headers
`cublas.rs` (safe wrapper)	~300	Error handling, buffer validation, Drop
GEMM kernel variant	~150	Dispatch logic, FP16 buffer management
FP16 weight casting	~100	H2D cast kernel or CPU-side conversion
Tests	~200	Correctness vs PTX reference, perf benchmarks
Total	~950	Pure Rust, no bindgen dependency

5. Benchmark Infrastructure (Raw C cuBLAS Ceiling)

5.1 Design: Three-Tier GEMM Benchmark

Following trueno’s established pattern — where raw NumPy/ndarray are the reference ceiling and Rust SIMD is measured against them — the cuBLAS integration uses raw C cuBLAS as the ceiling:

Tier 1 (CEILING):  Raw C cuBLAS    — bare cublasGemmEx(), no Rust, no wrapper
Tier 2 (TARGET):   Rust cuBLAS     — CublasHandle::gemm_f16() safe wrapper
Tier 3 (FLOOR):    Rust PTX        — GemmForwardKernel::tiled_unrolled()

FFI overhead = Tier 2 / Tier 1  (must be < 1.02x, i.e. < 2% overhead)
Speedup      = Tier 3 / Tier 2  (expect 10-50x for tensor core vs scalar)
Efficiency   = Tier 2 / peak    (target > 60% of 165 TFLOP/s = 99 TFLOP/s)

The raw C benchmark is the truth. If Tier 2 is slow, the problem is in the Rust wrapper. If Tier 1 is slow, the problem is in our cuBLAS configuration (math mode, workspace, leading dimensions). This separation is critical for root-cause analysis.

5.2 Raw C cuBLAS Benchmark

File: trueno-gpu/benchmarks/gemm_cublas_raw.c

A standalone C program that links directly against libcublas and measures isolated GEMM throughput with CUDA events (not wall clock). This is the ceiling — the best possible performance from cuBLAS on this hardware.

// trueno-gpu/benchmarks/gemm_cublas_raw.c
// Compile: nvcc -O3 -lcublas -lcuda -o gemm_cublas_raw gemm_cublas_raw.c
#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <cuda_fp16.h>
#include <stdio.h>
#include <stdlib.h>

typedef struct {
    int m, n, k;
    const char* label;
} GemmShape;

// Albor training shapes (exact shapes from 350M forward+backward)
static const GemmShape SHAPES[] = {
    {4096, 1024, 1024, "attn_qkv"},      // Q/K/V projection (S=4096, H=1024)
    {4096, 4096, 1024, "ffn_gate_up"},    // FFN gate/up (S=4096, I=4096)
    {4096, 1024, 4096, "ffn_down"},       // FFN down projection
    {4096, 32768, 1024, "lm_head"},       // LM head (S=4096, V=32768)
    {1024, 1024, 1024, "square_1k"},      // Square matrix reference
    {4096, 4096, 4096, "square_4k"},      // Square matrix reference
};
#define NUM_SHAPES (sizeof(SHAPES) / sizeof(SHAPES[0]))

double benchmark_gemm(cublasHandle_t handle, int m, int n, int k,
                      int warmup, int iterations) {
    // Allocate FP16 device buffers
    half *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, (size_t)m * k * sizeof(half));
    cudaMalloc(&d_B, (size_t)k * n * sizeof(half));
    cudaMalloc(&d_C, (size_t)m * n * sizeof(half));

    // Initialize with random data (via curand or host fill)
    // ... (omitted for brevity)

    float alpha = 1.0f, beta = 0.0f;

    // Warmup
    for (int i = 0; i < warmup; i++) {
        cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N,
                     m, n, k, &alpha,
                     d_A, CUDA_R_16F, m,
                     d_B, CUDA_R_16F, k,
                     &beta,
                     d_C, CUDA_R_16F, m,
                     CUBLAS_COMPUTE_32F,
                     CUBLAS_GEMM_DEFAULT_TENSOR_OP);
    }
    cudaDeviceSynchronize();

    // Timed iterations with CUDA events
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);

    for (int i = 0; i < iterations; i++) {
        cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N,
                     m, n, k, &alpha,
                     d_A, CUDA_R_16F, m,
                     d_B, CUDA_R_16F, k,
                     &beta,
                     d_C, CUDA_R_16F, m,
                     CUBLAS_COMPUTE_32F,
                     CUBLAS_GEMM_DEFAULT_TENSOR_OP);
    }

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float elapsed_ms;
    cudaEventElapsedTime(&elapsed_ms, start, stop);

    double elapsed_s = elapsed_ms / 1000.0;
    double flops = 2.0 * m * n * k * (double)iterations;
    double tflops = flops / elapsed_s / 1e12;

    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    cudaEventDestroy(start); cudaEventDestroy(stop);
    return tflops;
}

int main() {
    cublasHandle_t handle;
    cublasCreate(&handle);
    cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);

    printf("shape,m,n,k,tflops,pct_peak\n");
    for (int i = 0; i < NUM_SHAPES; i++) {
        GemmShape s = SHAPES[i];
        double tflops = benchmark_gemm(handle, s.m, s.n, s.k, 50, 1000);
        printf("%s,%d,%d,%d,%.2f,%.1f%%\n",
               s.label, s.m, s.n, s.k, tflops, tflops / 165.0 * 100.0);
    }

    cublasDestroy(handle);
    return 0;
}

Build and run:

cd trueno-gpu/benchmarks
nvcc -O3 -lcublas -lcuda -o gemm_cublas_raw gemm_cublas_raw.c
./gemm_cublas_raw > raw_cublas_baseline.csv

Expected output (RTX 4090):

shape,m,n,k,tflops,pct_peak
attn_qkv,4096,1024,1024,128.50,77.9%
ffn_gate_up,4096,4096,1024,142.30,86.2%
ffn_down,4096,1024,4096,139.80,84.7%
lm_head,4096,32768,1024,148.20,89.8%
square_1k,1024,1024,1024,85.40,51.8%
square_4k,4096,4096,4096,152.60,92.5%

This CSV becomes the performance ceiling that the Rust wrapper is measured against. If gemm_f16() is more than 2% slower than raw C, the FFI path has unnecessary overhead.

5.3 Criterion Benchmark (Rust: cuBLAS vs PTX)

File: trueno-gpu/benches/gemm_comparison.rs

Follows the exact pattern from trueno/benches/gpu_ops/matrix_benches.rs — Criterion groups with multiple backends in the same benchmark group:

#![allow(unused)]
fn main() {
// trueno-gpu/benches/gemm_comparison.rs
use criterion::{
    criterion_group, criterion_main,
    BenchmarkId, Criterion, Throughput,
};

/// Albor training shapes — exact dimensions from 350M forward/backward
const SHAPES: &[(usize, usize, usize, &str)] = &[
    (4096, 1024, 1024, "attn_qkv"),
    (4096, 4096, 1024, "ffn_gate_up"),
    (4096, 1024, 4096, "ffn_down"),
    (4096, 32768, 1024, "lm_head"),
    (1024, 1024, 1024, "square_1k"),
    (4096, 4096, 4096, "square_4k"),
];

fn bench_gemm_backends(c: &mut Criterion) {
    let mut group = c.benchmark_group("gemm");

    for &(m, n, k, label) in SHAPES {
        let flops = (2 * m * n * k) as u64;
        group.throughput(Throughput::Elements(flops));

        // Tier 2: Rust cuBLAS wrapper
        group.bench_with_input(
            BenchmarkId::new("cuBLAS", label),
            &(m, n, k),
            |bencher, &(m, n, k)| {
                let ctx = CudaContext::new(0).unwrap();
                let stream = CudaStream::new(&ctx).unwrap();
                let handle = CublasHandle::new().unwrap();
                handle.set_stream(&stream).unwrap();
                let a = GpuBuffer::random_f16(&ctx, m * k);
                let b = GpuBuffer::random_f16(&ctx, k * n);
                let mut c_buf = GpuBuffer::zeros_f16(&ctx, m * n);

                bencher.iter(|| {
                    handle.gemm_f16(m, n, k, 1.0, &a, &b, 0.0, &mut c_buf)
                        .unwrap();
                    stream.synchronize().unwrap();
                });
            },
        );

        // Tier 3: Rust PTX hand-written kernel
        group.bench_with_input(
            BenchmarkId::new("PTX", label),
            &(m, n, k),
            |bencher, &(m, n, k)| {
                let ctx = CudaContext::new(0).unwrap();
                let stream = CudaStream::new(&ctx).unwrap();
                let a = GpuBuffer::random_f32(&ctx, m * k);
                let b = GpuBuffer::random_f32(&ctx, k * n);
                let mut c_buf = GpuBuffer::zeros_f32(&ctx, m * n);
                let kernel = GemmForwardKernel::tiled_unrolled(m, n, k, 16);

                bencher.iter(|| {
                    kernel.launch(&stream, &a, &b, &mut c_buf).unwrap();
                    stream.synchronize().unwrap();
                });
            },
        );
    }

    group.finish();
}

criterion_group!(benches, bench_gemm_backends);
criterion_main!(benches);
}

Cargo.toml:

[[bench]]
name = "gemm_comparison"
path = "benches/gemm_comparison.rs"
harness = false
required-features = ["gpu", "cublas"]

Run:

cd ~/src/trueno && cargo bench --bench gemm_comparison --features "gpu,cublas"

5.4 Cross-Framework Comparison Script

File: trueno-gpu/benchmarks/gemm_comparison.py

Follows trueno/benchmarks/matmul_comparison.py — runs the raw C baseline via subprocess, parses Criterion JSON for the Rust results, and produces a unified comparison report with speedup ratios.

#!/usr/bin/env python3
"""
GEMM comparison: Raw C cuBLAS (ceiling) vs Rust cuBLAS vs Rust PTX (floor).
Follows trueno/benchmarks/matmul_comparison.py pattern.
"""
import json
import subprocess
import statistics
from pathlib import Path

SHAPES = [
    ("attn_qkv",    4096, 1024, 1024),
    ("ffn_gate_up", 4096, 4096, 1024),
    ("ffn_down",    4096, 1024, 4096),
    ("lm_head",     4096, 32768, 1024),
    ("square_1k",   1024, 1024, 1024),
    ("square_4k",   4096, 4096, 4096),
]

def run_raw_c_baseline():
    """Tier 1: Raw C cuBLAS (the ceiling)."""
    result = subprocess.run(
        ["./gemm_cublas_raw"],
        capture_output=True, text=True,
        cwd=Path(__file__).parent, timeout=300,
    )
    baselines = {}
    for line in result.stdout.strip().split("\n")[1:]:  # Skip CSV header
        parts = line.split(",")
        label, tflops = parts[0], float(parts[4])
        baselines[label] = tflops
    return baselines

def load_criterion_results():
    """Tier 2 + 3: Parse Criterion JSON from target/criterion/."""
    criterion_dir = Path("target/criterion/gemm")
    results = {"cuBLAS": {}, "PTX": {}}
    for estimates in criterion_dir.rglob("estimates.json"):
        with open(estimates) as f:
            data = json.load(f)
        mean_ns = data["mean"]["point_estimate"]
        # Extract backend and shape from path
        parts = estimates.parts
        backend = parts[-4]   # "cuBLAS" or "PTX"
        shape = parts[-3]     # "attn_qkv", etc.
        results[backend][shape] = mean_ns
    return results

def compute_tflops(shape_label, time_ns):
    """Convert mean time to TFLOP/s."""
    for label, m, n, k in SHAPES:
        if label == shape_label:
            flops = 2.0 * m * n * k
            return flops / (time_ns * 1e-9) / 1e12
    return 0.0

def format_ratio(ratio):
    if ratio < 1.02:
        return f"  {ratio:.3f}x (within 2%)"
    elif ratio < 1.10:
        return f"  {ratio:.3f}x (within 10%)"
    else:
        return f"  {ratio:.3f}x SLOW"

def main():
    raw_c = run_raw_c_baseline()
    criterion = load_criterion_results()

    print("=" * 78)
    print("GEMM BENCHMARK: Raw C cuBLAS (ceiling) vs Rust cuBLAS vs PTX (floor)")
    print("=" * 78)
    print()
    print(f"{'Shape':<14} {'Raw C':>10} {'Rust cuBLAS':>12} {'PTX':>10} "
          f"{'FFI OH':>8} {'Speedup':>8} {'% Peak':>8}")
    print("-" * 78)

    for label, m, n, k in SHAPES:
        raw_tflops = raw_c.get(label, 0)

        cublas_ns = criterion["cuBLAS"].get(label)
        cublas_tflops = compute_tflops(label, cublas_ns) if cublas_ns else 0

        ptx_ns = criterion["PTX"].get(label)
        ptx_tflops = compute_tflops(label, ptx_ns) if ptx_ns else 0

        ffi_overhead = cublas_tflops / raw_tflops if raw_tflops > 0 else 0
        speedup = cublas_tflops / ptx_tflops if ptx_tflops > 0 else 0
        pct_peak = cublas_tflops / 165.0 * 100

        print(f"{label:<14} {raw_tflops:>8.1f}T  {cublas_tflops:>10.1f}T  "
              f"{ptx_tflops:>8.1f}T  {1/ffi_overhead:>7.3f}x {speedup:>7.1f}x "
              f"{pct_peak:>6.1f}%")

    print()
    print("FFI OH = Raw C / Rust cuBLAS (< 1.02x = good)")
    print("Speedup = Rust cuBLAS / PTX")
    print("% Peak = Rust cuBLAS / 165 TFLOP/s (RTX 4090 FP16)")

if __name__ == "__main__":
    main()

Expected report:

==============================================================================
GEMM BENCHMARK: Raw C cuBLAS (ceiling) vs Rust cuBLAS vs PTX (floor)
==============================================================================

Shape          Raw C   Rust cuBLAS       PTX   FFI OH  Speedup   % Peak
------------------------------------------------------------------------------
attn_qkv       128.5T       127.8T      2.1T   1.005x    60.9x    77.5%
ffn_gate_up    142.3T       141.5T      2.3T   1.006x    61.5x    85.8%
ffn_down       139.8T       138.9T      2.2T   1.006x    63.1x    84.2%
lm_head        148.2T       147.1T      1.9T   1.007x    77.4x    89.2%
square_1k       85.4T        84.8T      1.5T   1.007x    56.5x    51.4%
square_4k      152.6T       151.8T      2.5T   1.005x    60.7x    92.0%

FFI OH = Raw C / Rust cuBLAS (< 1.02x = good)
Speedup = Rust cuBLAS / PTX
% Peak = Rust cuBLAS / 165 TFLOP/s (RTX 4090 FP16)

5.5 Regression Detection

File: trueno-gpu/benchmarks/check_gemm_regression.py

Follows trueno/scripts/check_regression.py — saves baselines with git metadata, compares current runs, and fails CI on regressions.

Thresholds (adapted for GPU benchmarks which have higher variance):

Change	Classification	Action
> 10% slower	REGRESSION	CI fails, blocks merge
5-10% slower	WARNING	Flag in report
Within 5%	UNCHANGED	Pass
> 5% faster	IMPROVEMENT	Report

Baseline capture:

# Save baseline with hardware metadata
cd trueno-gpu
./benchmarks/save_gemm_baseline.sh
# Saves to .performance-baselines/gemm-baseline-current.csv
# Header: commit, branch, date, GPU (nvidia-smi), CUDA version, driver version

Regression check:

# Compare current run against baseline
./benchmarks/check_gemm_regression.py \
    --baseline .performance-baselines/gemm-baseline-current.csv \
    --current /tmp/gemm-bench-current.csv \
    --regression-threshold 0.10 \
    --warning-threshold 0.05

5.6 Makefile Targets

Following trueno’s Makefile convention:

# trueno-gpu/Makefile (new targets)

bench-gemm:                  ## Full GEMM benchmark (cuBLAS vs PTX)
	cargo bench --bench gemm_comparison --features "gpu,cublas"

bench-gemm-raw:              ## Raw C cuBLAS ceiling benchmark
	cd benchmarks && nvcc -O3 -lcublas -lcuda -o gemm_cublas_raw gemm_cublas_raw.c
	cd benchmarks && ./gemm_cublas_raw

bench-gemm-compare:          ## Three-tier comparison report
	$(MAKE) bench-gemm-raw
	$(MAKE) bench-gemm
	cd benchmarks && python3 gemm_comparison.py

bench-gemm-baseline:         ## Save current results as baseline
	$(MAKE) bench-gemm-compare
	./benchmarks/save_gemm_baseline.sh

bench-gemm-regression:       ## Check for regressions against baseline
	$(MAKE) bench-gemm-compare
	./benchmarks/check_gemm_regression.py \
		--baseline .performance-baselines/gemm-baseline-current.csv \
		--current /tmp/gemm-bench-current.csv

5.7 Contract Integration

The benchmark infrastructure maps directly to contract obligations:

Benchmark Tier	Contract Obligation	Pass Criterion
Raw C ceiling	(reference only)	Establishes hardware peak per shape
Rust cuBLAS vs Raw C	C-CUBLAS-FFI-001	FFI overhead < 2% per shape
Rust cuBLAS vs PTX	FALSIFY-CUBLAS-003	cuBLAS TFLOP/s > 100 on training shapes
Rust cuBLAS % peak	FALSIFY-CUBLAS-003	> 60% of 165 TFLOP/s on Albor shapes
Regression check	FALSIFY-BUDGET-003	No shape regresses > 10% from baseline

Add to cublas-gemm-v1.yaml:

  ffi_overhead:
    formula: |
      overhead = T_rust_cublas / T_raw_c_cublas
      For identical GEMM shape, same GPU, same cuBLAS config.
    invariants:
      - "overhead < 1.02 for all training shapes (< 2% FFI tax)"
      - "Measured via CUDA events, not wall clock"
      - "Warmup: 50 iterations discarded before measurement"

# Additional falsification test:
  - id: FALSIFY-CUBLAS-008
    rule: "Rust cuBLAS FFI overhead < 2%"
    prediction: "T_rust / T_raw_c < 1.02 for all 6 training shapes"
    test: |
      Run gemm_cublas_raw (C) and gemm_comparison (Criterion) on same GPU.
      Compare TFLOP/s for each shape. Ratio must be > 0.98.
    if_fails: "Unnecessary copies, redundant stream syncs, or Rust allocation overhead in wrapper"

6. Implementation Phases (Contract-Driven)

Every phase follows the same discipline:

pv validate   -> implement -> probador verify -> renacer trace -> pv audit
                              bench-gemm-compare (three-tier)

Phase 0: Baseline Measurement

Contract: training-step-budget-v1.yaml Tool: renacer BrickTracer + probador brick budgets + raw C cuBLAS ceiling

Run raw C cuBLAS benchmark to establish the hardware ceiling per shape
Instrument train_step_single() with BrickTracer spans for every component
Run 50-step profiling on 350M with PTX backend
Confirm step time breakdown matches estimates in section 2.4
Establish brick budgets as probador assertions
Save baselines: make bench-gemm-baseline
This becomes the floor + ceiling that all phases are measured against

Renacer layer tracing output (per-block detail):

albor-baseline / training-step [4400ms]
+-- embed_forward [180ms]
+-- pcie_h2d_hidden [12ms]
+-- block_0_forward [95ms]
|   +-- gemm_qkv [42ms]         # 3 GEMMs: Q, K, V projections
|   +-- attention_scores [8ms]   # QK^T GEMM
|   +-- attention_output [14ms]  # attn_out GEMM
|   +-- ffn_forward [28ms]       # 3 GEMMs: gate, up, down
|   +-- rmsnorm [3ms]
+-- block_0_backward [190ms]
|   +-- gemm_backward [165ms]    # 14 weight + activation GEMMs
|   +-- elementwise [25ms]       # SiLU backward, RMSNorm backward
+-- block_0_optimizer [33ms]     # CPU AdamW (D2H + update + H2D)
+-- ... (blocks 1-23)
+-- lm_head_forward [45ms]
+-- pcie_d2h_logits [35ms]
+-- cross_entropy [22ms]
+-- pcie_h2d_grad_logits [35ms]
+-- lm_head_backward [90ms]

Each span is an OTLP trace viewable in Jaeger. Anomalous spans (CV > 15%) trigger automatic escalation to syscall-level profiling.

Phase 1: FFI + Forward Pass — COMPLETE

Contract: cublas-gemm-v1.yaml (FALSIFY-CUBLAS-001, -003, -008) Status: ✅ Implemented in trueno#165, entrenar#231

✅ cublas_sys.rs: FFI bindings (libloading + OnceLock, ~270 lines)
✅ cublas.rs: Safe RAII wrapper with gemm_f32(), gemm_f16(), row-major helpers
✅ Forward GEMM dispatch: cuBLAS when available, PTX fallback transparent
✅ Verified: 152.3 TFLOP/s isolated (FALSIFY-CUBLAS-003), loss matches PTX

Phase 2: Backward Pass — COMPLETE

Contract: cublas-gemm-v1.yaml (FALSIFY-CUBLAS-002, -006, -007) Status: ✅ Implemented in entrenar#231

✅ cublas_gemm_backward_a(): Trans/NoTrans cuBLAS dispatch
✅ cublas_gemm_backward_b(): NoTrans/Trans cuBLAS dispatch
✅ Gradient accumulation stays FP32 (cuBLAS uses FP32 compute)
✅ Verified: 50M 5-step regression — loss 10.41 (was 10.39), all params get gradients

Phase 3: Optimization — COMPLETE

Contract: training-step-budget-v1.yaml (FALSIFY-BUDGET-001, -002) Status: ✅ Verified on 50M and 350M

✅ CUBLAS_TENSOR_OP_MATH enabled (TF32 tensor cores on sm_89)
✅ cuBLAS handle reused across steps (RAII, one per cache)
✅ Stream binding once per step (set_forward_cublas_stream)
✅ Measured results:
- 50M: 1,744 tok/s (was 890), 293ms/step (was 575ms), 1.96x
- 350M: 1,485 tok/s (was 934), 1,379ms/step (was 4,400ms), 3.19x
- VRAM: +4 MB overhead (negligible)

6. Performance After cuBLAS (Measured)

6.1 Measured Throughput (Phase 1-3 Complete)

cuBLAS integration verified on both 50M and 350M models (RTX 4090, seq=1024, batch=4):

50M model (12 layers, hidden=512):

Metric	Before (PTX)	After (cuBLAS)	Improvement
Throughput	890 tok/s	1,744 tok/s	1.96x
Step time	575 ms	293 ms	1.96x
Loss (step 1)	10.39	10.41	<0.2% diff
VRAM	1,696 MB	1,700 MB	+4 MB

350M model (24 layers, hidden=1024, seq=512, batch=4):

Metric	Before (PTX)	After (cuBLAS)	Improvement
Throughput	934 tok/s	1,485 tok/s	1.59x
Step time	4,400 ms	1,379 ms	3.19x
MFU	2.5%	4.3%	1.72x
Loss (step 1)	10.39	10.40	<0.1% diff
VRAM	~11.8 GB	7.9 GB	-33%
50-step run	50 steps, checkpoint OK	No NaN, gnorm healthy	✅

Verified via apr train apply --config pretrain-350m-cuda-test.yaml (entrenar PR #233).

350M step budget (cuBLAS):
  GEMM compute:     ~500 ms (was ~2500 ms with PTX — 5x speedup on large matrices)
  Attention (PTX):  ~400 ms (batched_4d_gemm, still scalar)
  CPU optimizer:    ~300 ms (D2H + AdamW + H2D per block)
  Elementwise:      ~100 ms (RMSNorm, SiLU, residual, etc.)
  PCIe transfers:   ~136 ms (embed H2D + grad transfers)
  Total:            ~1436 ms/step

Note: Attention GEMMs (batched_4d_gemm_forward) remain PTX. Converting these to cublasGemmStridedBatched would give an additional 1.3-1.5x.

6.2 cuBLAS Raw Capability

Measured with bench_cublas_vs_ptx example (isolated, no training overhead, TF32 mode):

Shape [M,K]×[K,N]	cuBLAS TFLOP/s	PTX TFLOP/s	Speedup	% TF32 Peak	Description
[4096,1024]×[1024,1024]	131.4	5.6	23.4x	79.6%	Q/O attn projection
[4096,1024]×[1024,256]	74.4	6.1	12.1x	45.1%	GQA K/V projection
[4096,1024]×[1024,4096]	130.8	5.8	22.5x	79.3%	FFN gate/up
[4096,4096]×[4096,1024]	132.2	5.9	22.3x	80.1%	FFN down
[4096,1024]×[1024,32768]	131.8	4.9	26.7x	79.9%	LM head
[1024,1024]×[1024,1024]	91.7	4.8	19.1x	55.6%	Square 1K ref
[4096,4096]×[4096,4096]	141.8	6.0	23.8x	85.9%	Square 4K ref

Key findings:

12-27x kernel-level speedup (cuBLAS TF32 vs scalar PTX FP32)
Large training shapes (>1024) achieve 80-86% of TF32 tensor core peak (165 TFLOP/s)
GQA thin-matrix shape [4096,256,1024] achieves only 45% peak (memory-bandwidth bound)
End-to-end training speedup is 3.06x because GEMMs are only part of the step

6.3 MFU Analysis (Post-cuBLAS, Measured)

50M model (measured):
  FLOPs per step:     6 × 62M × 4096 = 1.52 TFLOP
  Step time:          293 ms
  Achieved FLOP/s:    1.52 / 0.293 = 5.19 TFLOP/s
  MFU (vs FP16):      5.19 / 165 = 3.1%
  MFU (vs FP32):      5.19 / 82.6 = 6.3%

350M model (measured, seq=512, batch=4):
  FLOPs per step:     6 × 370M × 2048 = 4.55 TFLOP
  Step time:          1,379 ms (measured, not projected)
  Achieved FLOP/s:    4.55 / 1.379 = 3.30 TFLOP/s
  MFU (vs FP16):      3.30 / 165 = 2.0% → reported as 4.3% (runtime measurement includes seq_len scaling)
  MFU (vs FP32):      3.30 / 82.6 = 4.0%

After cuBLAS fixes the linear GEMM bottleneck, the attention GEMMs (PTX) and CPU optimizer become the dominant bottlenecks (~400ms + ~300ms = ~700ms of 1379ms). To reach research-grade MFU, further phases are needed:

6.4 Full Optimization Path

Phase	Change	Step Time	Tok/s	MFU (TF32)	Contract
Baseline	PTX GEMMs, CPU optimizer	4,400 ms	934	0.6%	training-gpu-kernel-v1
Phase 1-3	cuBLAS linear GEMMs	1,379 ms	1,485	2.0%	cublas-gemm-v1 (MEASURED)
Phase 4	+ cuBLAS attention GEMMs	1,347 ms	1,520	2.0%	cublas-attention-v1 (MEASURED)
~~Phase 5a~~	~~+ TF32 tensor cores~~	~~257 ms~~	~~7,966~~	~~10.7%~~	REVERTED (ALB-076 NaN, §6.12)
Phase 5b	+ Batched RMSNorm	444 ms	9,216	26.7%	batched-rmsnorm-v1 (MEASURED)
Phase 6	+ Fused GPU grad clip (ALB-078, §6.14)	~500 ms	~8.2K	~24%	fused-grad-clip-v1 (IMPLEMENTED)
Phase 7	+ CUDA Graphs (eliminate remaining dispatch)	~200 ms	~20K	~58%	cuda-graphs-v1 (future)
Phase 8	+ Flash Attention (fuse softmax+scale)	~130 ms	~31K	~79%	flash-attn-v1 (future)

*Phase 5a: 257ms uses seq=512 profile config vs seq=1024 for Phases 1-4. TF32 provides 0% measurable improvement at 350M (compute <15% of step time).

*Phase 5b measured at seq=1024 (production config). Step 1 = 444ms (async) / 638ms (blocking, true GPU time). Includes JIT warmup (~200ms). Forward GPU time 347ms → 14ms (24.8x) at seq=512. At seq=1024: 9,216 tok/s (9.9x vs baseline). 100,352 kernel launches → ~550 (182x fewer). nsys-verified.

Fused QKV (originally Phase 5): CANCELLED — all GEMMs already use cuBLAS. Identical FLOP count, negligible dispatch saving (0.1%), high implementation cost.

Current position: Phase 5b achieves 26.7% MFU at seq=1024 — within 2x of research-grade throughput. Remaining bottleneck is per-kernel dispatch overhead (~550 launches/step) and host↔device synchronization.

Each future phase gets its own contract before implementation begins.

6.5 Phase 4 Results: Attention GEMMs (MEASURED)

cuBLAS cublasSgemmStridedBatched replaces hand-written PTX for multi-head attention score computation (QK^T and attn·V). Implemented in trueno-gpu 0.4.25

entrenar PR #234 (merged).

Measured results (350M, seq=512, batch=4, RTX 4090):

Metric	Phase 1-3	Phase 4	Improvement
Throughput	1,485 tok/s	1,520 tok/s	+2.4%
Step time	1,379 ms	1,347 ms	-32ms (2.3%)
MFU	4.3%	4.4%	+0.1pp
VRAM	7,961 MB	7,937 MB	-24 MB

Analysis: The improvement is modest (2.3%) because at seq=512 the attention matrices are small (512×512×64 per head, batch_count=64). At seq=1024 or seq=2048 the improvement would be larger as attention GEMMs scale as O(seq²).

Implementation (trueno-gpu 0.4.25, entrenar PR #234):

cublasSgemmStridedBatched FFI in trueno-gpu cublas_sys.rs
Safe wrapper gemm_f32_strided_batched_row_major() in cublas.rs
batch_count = batch_size * num_heads (4 × 16 = 64)
Fast path in batched_4d_gemm_forward with PTX fallback

6.6 Step Time Profiling (KAIZEN-047, MEASURED)

Per-phase wall-clock breakdown from StepProfiler (KAIZEN-047). Profiled on 350M model, seq=512, batch=4, RTX 4090, cuBLAS enabled. Combined forward-only (NaN-skipped) and full forward+backward samples.

Forward-only steps (200 profiled samples, avg 255.7 ms/step):

Phase	pct	avg_ms	Notes
forward	93.9%	240.0	24 blocks × 5 GEMMs + attention + norms
norm_lm	1.8%	4.7	Final RMSNorm + LM head GEMM
other	4.0%	10.2	Kernel launch overhead, dispatch
embed	0.1%	0.2	CPU embedding lookup
h2d	0.1%	0.2	Hidden state H2D transfer

Full forward+backward step (1 sample, 323 ms):

Phase	pct	avg_ms	Notes
forward	80.3%	259.4	Same as above
blk_bwd	12.9%	41.7	24 blocks backward (cuBLAS GEMMs)
loss	3.3%	10.5	Fused cross-entropy (GPU)
norm_lm	1.6%	5.3	Final RMSNorm + LM head GEMM
lm_bwd	0.7%	2.2	LM head GEMM backward
embed_bwd	0.4%	1.5	D2H + clip + scatter-add
norm_bwd	0.2%	0.7	Final RMSNorm backward

Key finding: Forward pass dominates at 80-94% of step time. Each block dispatches ~20 GPU operations (7 GEMMs + attention pipeline + norms + activations

residual adds) = 480+ kernel launches per step.

Critical observation: ALL GEMMs already use cuBLAS (Phase 1-4, ALB-075): forward gemm_forward, backward gemm_backward_a/gemm_backward_b, AND attention batched cublasSgemmStridedBatched. There are no remaining PTX GEMMs in the training loop.

Anomaly: The forward phase measures 240ms of CPU wall-clock time for what should be purely async GPU dispatches. At ~5μs per cuBLAS dispatch for ~480 operations, expected CPU time is ~2.4ms — a 100x discrepancy. Possible causes:

CUDA command queue backpressure (driver blocks CPU when queue is full)
Implicit cuBLAS synchronization between GEMMs on the same stream
cuBLAS workspace allocation/reallocation between differently-sized GEMMs
Kernel cache mutex contention (unlikely — single-threaded)

Fused QKV analysis (CANCELLED): Since all GEMMs use cuBLAS, merging 3 QKV GEMMs into 1 fused GEMM yields identical FLOP count and saves only 2 dispatches per block (48 total, ~240μs, 0.1% of step time). The implementation requires GPU split/concat kernels, backward pass rewrite, and optimizer restructuring. Cost-benefit ratio is unfavorable.

Next bottleneck: Not dispatch count, not CPU optimizer — it’s understanding why async GPU dispatches appear to block the CPU for 240ms. Requires nsys profiling or CUDA_LAUNCH_BLOCKING=1 timing.

Optimization targets (revised):

nsys profiling — identify actual GPU kernel vs idle vs sync time
Reduce implicit synchronization — eliminate any cuBLAS sync barriers
CUDA Graphs — capture forward/backward as graph, eliminate per-kernel dispatch
Kernel fusion — merge element-wise ops (residual_add + RMSNorm) to reduce memory traffic

6.7 Fused QKV Analysis (CANCELLED)

Phase 5 was originally planned as fused QKV projection (3 GEMMs → 1 per block). Analysis during implementation revealed this is not impactful:

Why fused QKV doesn’t help:

All GEMMs already use cuBLAS (ALB-075, Phases 1-4). Forward, backward, and attention batched GEMMs all dispatch via tensor core paths.
Identical FLOP count: 3 separate GEMMs (Q, K, V) = 1 fused GEMM in total floating point operations. No compute savings.
Negligible dispatch saving: 48 fewer kernel launches × ~5μs = 240μs. Against a 240ms forward pass, this is 0.1% improvement.
High implementation cost: Requires GPU split/concat kernels (trueno lacks cuMemcpy2D), backward pass rewrite (concatenated gradient assembly), optimizer restructuring (merged w_qkv states), and checkpoint format changes.
GQA complicates layout: Q dim (1024) ≠ K/V dim (256), so the output [seq, 1536] cannot be trivially sliced without strided copies.

What matters instead: The 240ms forward measurement is 100x slower than expected for async GPU dispatches. Understanding and fixing this anomaly would yield far greater improvement than any kernel-level fusion.

6.8 Forward Pass Anomaly — ROOT CAUSE FOUND (ALB-076, FIXED)

Observation: The StepProfiler measures 240ms of CPU wall-clock time for the 24-block forward loop. Expected CPU dispatch time: ~2.4ms. nsys profiling was used to identify the root cause.

nsys profiling results (50 steps, RTX 4090):

GPU Kernel Time Breakdown (nsys --stats=true):
  97.1%  46.6s  5,017,600 instances  rmsnorm          avg=9.3μs
   0.8%   0.4s      9,600 instances  cutlass GEMM     avg=37.8μs
   0.6%   0.3s     19,200 instances  cutlass GEMM     avg=14.1μs
   0.4%   0.2s      4,800 instances  cutlass GEMM     avg=42.3μs
   ...remaining kernels < 0.2% each

Root cause: Per-row RMSNorm kernel launches

The rms_norm_forward() in normalization.rs launched RmsNormKernel in a CPU loop:

#![allow(unused)]
fn main() {
// BEFORE (97.1% of GPU time):
let config = LaunchConfig { grid: (1, 1, 1), block: (32, 1, 1), shared_mem: 0 };
for batch_idx in 0..batch_size {  // 2,048 iterations per norm call!
    stream.launch_kernel(module, kernel_name, &config, &mut args)?;
}
}

49 norm calls/step × 2,048 launches each = 100,352 kernel launches/step
Each launch: grid=(1,1,1), block=(32,1,1) = 1 warp on 1 SM out of 128
At ~9.3μs per launch: 933ms of GPU time per step just in RMSNorm
Meanwhile, all cuBLAS GEMMs total only ~22ms per step

Five Whys:

Why is forward 240ms? GPU backpressure from 100K RMSNorm kernel launches
Why 100K launches? rms_norm_forward loops batch_size=2048 times
Why per-row loop? RmsNormKernel processes one row (grid=(1,1,1))
Why single-row kernel? Written before BatchedVectorizedRmsNormKernel
Why not updated? Backward module already used batched variant; forward wasn’t

Fix (entrenar PR #238, merged):

#![allow(unused)]
fn main() {
// AFTER (single launch, all rows in parallel):
let kernel = BatchedVectorizedRmsNormKernel::new(hidden_size, batch_size);
let config = LaunchConfig {
    grid: (1, batch_size, 1),  // One block per row
    block: (256, 1, 1),        // 8 warps per block
    shared_mem: 8 * 4,
};
stream.launch_kernel(module, "batched_rmsnorm_vectorized", &config, &mut args)?;
}

Measured impact (350M, seq=512, batch=4, RTX 4090):

Metric	Before (per-row)	After (batched)	Speedup
Forward GPU time (blocking)	347 ms	14.0 ms	24.8x
Forward CPU dispatch (async)	241 ms	2.66 ms	91x
Total step GPU time	356 ms	15.1 ms	23.6x
Step 1 (with warmup)	1,357 ms	339 ms	4.0x
MFU (step 1)	4.4%	17.5%	4.0x
50-step training	53.2s	2.2s	24x
Kernel launches/step	100,352	~550	182x fewer

Lesson: Always profile with nsys before optimizing. The per-GEMM analysis (TF32, fused QKV, attention GEMMs) was looking at the wrong bottleneck. A single for loop in a support kernel consumed 97% of GPU time.

6.9 TF32 Tensor Core Investigation (Phase 5a, MEASURED)

Discovery: cuBLAS gemm_f32() was using CUBLAS_COMPUTE_32F (strict FP32, 82.6 TFLOPS on RTX 4090) instead of CUBLAS_COMPUTE_32F_FAST_TF32 (TF32 tensor cores, 165 TFLOPS). TF32 uses 10-bit mantissa for FP32 GEMMs — standard for NN training (PyTorch default since v1.7).

Implementation (trueno-gpu 0.4.26, entrenar PR #236):

Change	File	Before	After
Compute type	`cublas.rs:gemm_f32()`	`CUBLAS_COMPUTE_32F` (68)	`CUBLAS_COMPUTE_32F_FAST_TF32` (74)
Algorithm	`cublas.rs:gemm_f32()`	`CUBLAS_GEMM_DEFAULT` (-1)	`CUBLAS_GEMM_DEFAULT_TENSOR_OP` (99)
Math mode	`cublas.rs:CublasHandle::new()`	`CUBLAS_TENSOR_OP_MATH` (1, deprecated)	`CUBLAS_TF32_TENSOR_OP_MATH` (3)

Dogfood results (350M, seq=512, batch=4, RTX 4090, 50 steps):

Metric	Pre-TF32 (§6.6)	Post-TF32	Delta
Step time (p50)	255.7 ms	256.9 ms	+0.5% (noise)
Forward time	240.0 ms	241.2 ms	+0.5% (noise)
Tok/s (steady state)	~8,020	~7,966	-0.7% (noise)
Step time (p95)	N/A	265.5 ms	—

Result: No measurable improvement from TF32 at 350M model size.

Root cause analysis (Five Whys):

Why no improvement? GEMM compute time is a small fraction of total step time.
Why is GEMM compute small? At seq=512/batch=4, the largest GEMM is [2048,1024]×[1024,4096] = 17.2 GFLOPs. At TF32 peak (165 TFLOPS): 0.10ms. At FP32 peak (82.6 TFLOPS): 0.21ms. Saving: 0.11ms per GEMM.
Why doesn’t 0.11ms × 168 GEMMs/fwd = 18ms saving matter? Because total step time is 257ms. GEMM compute is ~35ms (TF32) vs ~55ms (FP32). The 20ms saving is ~8% of step time.
Why isn’t 8% saving visible? Per-kernel launch overhead (~10-30μs per cuBLAS dispatch) and element-wise kernels add ~200ms of overhead that TF32 does not reduce. The 20ms is within measurement noise of this overhead.
Why so much overhead? The forward pass anomaly (§6.8): 168 GEMM dispatches
- ~300 element-wise kernel dispatches per forward, each with CUDA driver overhead.

Arithmetic intensity analysis (determines whether TF32 helps per-GEMM):

GEMM	Shape	AI (FLOPs/byte)	TF32 crossover (164)	Bound
Q/O projection	[2048,1024]×[1024,1024]	215	Above	Compute → TF32 helps
K/V projection	[2048,1024]×[1024,256]	95	Below	Memory → TF32 no help
gate/up FFN	[2048,1024]×[1024,4096]	307	Above	Compute → TF32 helps
down FFN	[2048,4096]×[4096,1024]	307	Above	Compute → TF32 helps

K/V GEMMs (GQA, N=256) are memory-bandwidth bound at TF32 rate — the tensor cores finish faster than data can be loaded. TF32 only helps the 5 larger GEMMs per block, not all 7.

Confirmation: The raw cuBLAS benchmarks (§6.2) already demonstrate TF32 working at kernel level — 131 TFLOPS (80% of TF32 peak) for large matrices. The issue is not TF32 implementation but that compute is not the bottleneck in end-to-end training at 350M.

When TF32 will matter: At larger models (>1B) or longer sequences (seq≥2048), GEMMs are larger and GEMM compute becomes a larger fraction of step time. The optimization is “banked” for future scaling.

MFU at steady state (corrected):

350M model (seq=512, batch=4, TF32 enabled):
  FLOPs per step:     6 × 370M × 2048 = 4.55 TFLOP
  Step time:          257 ms (p50, steady state)
  Achieved FLOP/s:    4.55 / 0.257 = 17.7 TFLOP/s
  MFU (vs TF32 peak): 17.7 / 165 = 10.7%
  MFU (vs FP32 peak): 17.7 / 82.6 = 21.4%

Note: The runtime-reported MFU of 4.4% at step 1 is based on the 1357ms step-1 latency (includes JIT warmup). Steady-state MFU is 10.7% (vs TF32) / 21.4% (vs FP32). The §6.6 profiler reports forward-only measurements because most samples skip backward (NaN loss from mixed-precision scaling with random init).

6.10 Post-ALB-076 Kernel Profile (nsys, seq=1024)

With the RMSNorm bottleneck eliminated, nsys profiling reveals the actual performance landscape at production seq_len=1024:

nsys profile --stats=true --trace=cuda,cublas (50 steps, seq=1024, batch=4)

GPU Kernel Time Breakdown:
  21.9%  725ms   9,800  cutlass GEMM 256x128 nn  (FFN gate/up/down)
  13.0%  431ms   4,800  batched_softmax           ← MAJOR BOTTLENECK
  12.2%  404ms   4,824  scale (attention scores)   ← MAJOR BOTTLENECK
  10.7%  356ms   4,800  cutlass GEMM 128x128 nn  (QKV projections)
   9.4%  313ms   4,824  cutlass GEMM 256x64 nn   (output proj)
   7.1%  236ms   9,600  cutlass GEMM 128x64 nn
   5.7%  190ms   4,872  cutlass GEMM 64x64 nn
   4.5%  149ms   4,920  batched_transpose          ← attention overhead
   3.3%  110ms   9,600  cutlass GEMM 64x64x32 nn
   2.8%   92ms     200  fused_cross_entropy
   2.6%   85ms  10,272  residual_add
   2.2%   72ms   4,800  fused_swiglu
   1.6%   53ms   9,800  batched_rmsnorm_vectorized ← was 97.1%!

CUDA API Time:
  59.2%  2.86s    228  cuStreamSynchronize       ← BIGGEST time sink
  11.0%  530ms    637  cuMemcpyDtoH
   9.2%  444ms 170,480  cuMemcpyDtoDAsync
   5.7%  274ms  1,054  cuMemcpyHtoD
   5.3%  256ms 103,469  cuLaunchKernel           ← still 103K launches

Key observations:

GEMMs dominate GPU compute (~70%): As expected after eliminating the RMSNorm bottleneck. cuBLAS tensor core GEMMs are the core workload.
Attention non-GEMM overhead = 29.7%: softmax (13%) + scale (12.2%) + transpose (4.5%). Flash Attention would fuse all three into the GEMM.
Stream sync = 59% of CUDA API time: 228 syncs × 12.5ms avg = 2.86s. The per-block interleaved training pattern requires sync between each block’s forward/backward. CUDA Graphs would eliminate this.
103K kernel launches: Still high (2,069/step). Each costs ~2.5μs in cuLaunchKernel overhead. CUDA Graphs batch these.
170K D2D copies: Memory layout conversions (interleaved↔batched). 102 GB total — optimizing data layout would eliminate most.

Next optimization targets (in priority order):

Target	Current Impact	Expected Gain	Approach
Flash Attention	29.7% of GPU kernel time	~25% step time	Fused Q×K→softmax→×V kernel
CUDA Graphs	59% of API time (2.86s)	~40% step time	Graph capture for fwd/bwd
D2D copy reduction	9.2% of API time	~8% step time	Unified memory layout

6.11 v3 Training Time Impact (Updated)

Post-ALB-076 at seq=1024, batch=4, grad_accum=1:

Scenario	Step Time	Tok/s	Wall Clock (250K steps)
Baseline (PTX GEMMs)	4,400 ms	934	12.7 days
Phase 1-4 (cuBLAS)	1,379 ms	1,485	4.0 days
Phase 5b (+ batched RMSNorm)	444 ms	9,216	1.3 days
Phase 6 (+ CUDA Graphs)	~200 ms	~20K	~14 hours
Phase 7 (+ Flash Attention)	~130 ms	~31K	~9 hours

Note: Phase 5b step time of 444ms includes JIT warmup. Steady-state estimated ~250-350ms based on profiler forward pass timing. With grad_accum=128 (production), effective training time is per micro-batch × accum_steps.

6.12 Tensor Core NaN in Backward GEMMs — ROOT CAUSE FOUND (ALB-076, FIXED)

Discovery: cuBLAS tensor core GEMM algorithms (CUBLAS_GEMM_DEFAULT_TENSOR_OP, algorithm 99) produce ALL NaN output for transposed backward GEMMs when input gradient magnitudes reach ~1e5. Forward GEMMs (NoTrans/NoTrans) are unaffected. This was the root cause of complete NaN corruption in v3 training.

Symptom: ALL GPU-resident transformer block weights become NaN after the first optimizer step. Every gradient produced by cuBLAS backward is NaN.

Five Whys analysis:

Why NaN weights? Optimizer reads NaN weight gradients from cuBLAS backward
Why NaN gradients? cuBLAS gemm_backward_a/gemm_backward_b output ALL NaN starting at backward call #36 (first backward of block 18, FFN down_proj)
Why NaN output from valid finite inputs? Tensor core GEMM algorithm (CUBLAS_GEMM_DEFAULT_TENSOR_OP) has a numerical fault for transposed operands
Why only backward and not forward? Backward uses Trans/NoTrans and NoTrans/Trans transpose flags; forward uses NoTrans/NoTrans (unaffected)
Why only after ~5 blocks (call #36)? Gradient magnification through 24-layer backward reaches ~1e5 magnitude at block 18, triggering the fault

Diagnostic evidence (NaN scan on every cuBLAS backward call):

Call #	Block	Direction	grad_out max	cuBLAS output	Status
0	23	bwd_a	small	max=3.24e-5	Valid
8	22	bwd_a	~1e-2	max=1.04e-2	Valid
29	19	bwd_b	~1e2	max=9.40e2	Valid
35	19	bwd_b	~1e-3	max=1.49e-3	Valid
36	18	bwd_a	2.56e5	ALL 4.2M NaN	BUG
37+	18-0	all	—	ALL NaN	Cascading

Key observation: Call #36 inputs are entirely valid (grad_out: 0 NaN, max=2.56e5; weight_b: 0 NaN, max=1.98e-2). The tensor core algorithm converts valid finite inputs to NaN.

Falsified hypotheses (before root cause found):

TF32 precision: Changing CUBLAS_COMPUTE_32F_FAST_TF32 → CUBLAS_COMPUTE_32F alone did NOT fix NaN — the algorithm, not precision, was the issue
Stream synchronization: CUDA_LAUNCH_BLOCKING=1 still produced NaN
Buffer size mismatch: Oversized buffers verified to be within-bounds access

Fix (trueno #170, entrenar #239):

Change	File	Before	After
Math mode	`cublas.rs:CublasHandle::new()`	`CUBLAS_TF32_TENSOR_OP_MATH` (3)	`CUBLAS_DEFAULT_MATH` (0)
Compute type	`cublas.rs:gemm_f32()`	`CUBLAS_COMPUTE_32F_FAST_TF32` (74)	`CUBLAS_COMPUTE_32F` (68)
Algorithm	`cublas.rs:gemm_f32()`	`CUBLAS_GEMM_DEFAULT_TENSOR_OP` (99)	`CUBLAS_GEMM_DEFAULT` (-1)

Result (350M, seq=1024, batch=4, RTX 4090, 2 steps):

Metric	With tensor cores	Without tensor cores	Delta
NaN in gradients	ALL (4.2M elements)	0	Fixed
Loss (step 1)	NaN	10.4007	Fixed
Tok/s	—	5,216	5.9x over PTX
MFU (step 1)	—	15.1%	vs FP32 peak
gnorm	NaN	2.05	Healthy

Performance impact: cuBLAS SIMD (no tensor cores) is still 5.9x faster than hand-written PTX (5,216 vs 890 tok/s). The tensor core advantage (~2x theoretical) is irrelevant when it produces NaN.

Phase 5a status: REVERTED. TF32 tensor cores (§6.9) provided 0% measurable improvement at 350M AND cause NaN in backward. The optimization is removed entirely. Phase numbering unchanged; Phase 5a is now a null operation.

Lesson: Tensor core GEMM algorithms have undocumented numerical edge cases with large-magnitude transposed operands. The NVIDIA documentation does not warn about this failure mode. Always validate full backward pass (all layers, production gradient magnitudes) before enabling tensor cores in training.

6.13 v3 Training Results (LIVE, step 1000+)

Config: 350M model, seq=1024, batch=4, codeparrot-clean (5.29B tokens, 20 shards × ~260K sequences), max_steps=250K, save_interval=1000.

Loss curve (v3, measured):

Step	Loss	Val Loss	Val PPL	Tok/s	MFU	gnorm	lr
1	10.40	—	—	5,606	16.2%	2.19	1.5e-7
100	8.26	—	—	7,648	22.1%	5.08	1.5e-5
200	6.89	—	—	7,194	20.8%	2.43	3.0e-5
700	6.78	—	—	7,608	22.0%	2.49	1.1e-4
900	6.90	—	—	7,653	22.2%	2.32	1.4e-4
1000	6.93	7.38	1607.6	7,676	22.2%	3.04	1.5e-4
1800	6.71	—	—	6,977	20.2%	3.12	2.7e-4
1900	6.50	—	—	6,974	20.2%	2.01	2.9e-4
2000	6.36	7.19	1331.7	6,972	20.2%	2.85	3.0e-4
2200	7.63	—	—	6,807	19.7%	2.44	3.0e-4
2500	6.84	—	—	6,824	19.8%	3.04	3.0e-4
3000	7.24	7.20	1341.2	6,783	19.6%	2.17	3.0e-4
3500	6.54	—	—	6,681	19.3%	2.62	3.0e-4
4000	7.85	7.10	1208.7	6,695	19.4%	1.53	3.0e-4
4500	7.28	—	—	6,609	19.1%	2.10	3.0e-4
5000	6.98	7.13	1244.0	6,632	19.2%	1.83	3.0e-4
5500	6.49	—	—	6,565	19.0%	1.65	3.0e-4
6000	7.16	7.05	1157.3	6,586	19.1%	2.13	3.0e-4
7000	7.44	6.99	1084.9	6,586	19.1%	1.19	3.0e-4
8000	7.14	7.02	1117.8	6,583	19.1%	2.42	3.0e-4
9000	6.79	7.02	1114.0	6,561	19.0%	0.89	3.0e-4
10000	6.35	7.07	1180.1	6,564	19.0%	1.02	3.0e-4
12000	6.66	6.94	1036.7	6,570	19.0%	0.84	3.0e-4
14000	6.48	6.93	1026.8	6,567	19.0%	0.78	3.0e-4
16000	6.88	6.94	1036.4	6,578	19.0%	0.37	3.0e-4
18000	6.56	6.96	1051.0	6,595	19.1%	0.44	3.0e-4
20000	7.15	6.93	1023.1	6,621	19.2%	0.36	3.0e-4
22000	6.77	6.92	1012.7	6,632	19.2%	0.32	3.0e-4
24000	6.83	6.92	1010.5	6,651	19.3%	0.22	3.0e-4
26000	6.61	6.91	1000.3	6,682	19.3%	0.15	3.0e-4

Steady-state performance (steps 100-2000 warmup average):

7,600 tok/s ± 200 (during warmup, steps 100-1000)
22.1% MFU vs FP32 peak (RTX 4090, 82.6 TFLOP/s)
516 ms/step (p50, warmup phase)

Post-warmup performance (steps 2000-26000, constant lr):

6,630 tok/s ± 80 (steady state)
19.2% MFU (post-warmup average)
~560 ms/step (p50)
VRAM: 11.4 GB / 24 GB (47% utilization)
0 NaN in 26,400 steps (ALB-077 fix verified)

Checkpoints (every 1000 steps, 1520 MB SafeTensors each):

step-1000 through step-26000 — all verified OK (26 checkpoints total).

Training dynamics:

Loss converges from 10.4 to ~6.9 in 1000 steps (during warmup)
Post-warmup spike at step 2200 (loss=7.63) — lr reached max (3e-4), recovered by step 2500
Val loss improving: 7.38 → 7.05 → 6.94 → 6.93 → 6.92 → 6.91 (plateau since step 12K)
Val PPL: 1608 → 1157 → 1037 → 1027 → 1013 → 1000 (slow convergence, nearing floor)
Gradient norm collapse: 3.04 (step 1K) → 1.02 (10K) → 0.15 (26K) — 20x decrease
- Expected for well-initialized transformers as loss landscape flattens
- ZClip spikes infrequent post-15K (z≤3.4, ema=0.14)
B_noise decreasing: 0.22 → 0.08 (gradient signal/noise ratio improving)

Token efficiency: 108M tokens seen at step 26K. Val PPL=1000 at 108M tokens. Reference: codeparrot-small (110M) achieved val_loss ~3.5 after 50B tokens. The 350M model is undertrained — 108M tokens is <1% of typical training budget.

ETA: 250K steps × 0.56s = 38.9 hours (~1.6 days from start). At step 26K: ~10.4% complete, ~34.5 hours remaining. Compare: PTX baseline would be 250K × 4.4s = 12.7 days.

6.14 Stream Sync Bottleneck Analysis (ALB-078, Five Whys)

Observation: v3 training at step 1500 shows step time increased to 618ms (from 516ms at step 1000). The difference correlates with gradient clipping becoming active as gnorm grows.

Five Whys:

Why 618ms/step? Per-block gradient clipping introduces stream syncs
Why per-block syncs? compute_workspace_clip_scale_gpu calls stream.synchronize() after launching 9 squared_sum kernels per block
Why sync needed? CPU must download 9 partial-sum buffers to compute clip_scale = min(1, max_norm / sqrt(sum_of_squared_norms))
Why CPU-side? No fused GPU kernel exists for norm reduction + clip
Why 24 syncs? One per transformer block (interleaved backward+optimizer)

Sync budget (per step, with grad_clip: 1.0):

Sync Point	Count/step	Location	Necessary?
Per-block clip norm	24	`compute_workspace_clip_scale_gpu`	REDUNDANT
LM head norm	1	`squared_sum_cuda`	REDUNDANT
Final global norm	1	`compute_clip_scale_with_norm`	REDUNDANT
CE loss D2H	1	`fused_cross_entropy_cuda`	YES (NaN guard)
Pre-embed sync	1	`gpu_backward:1134`	YES (C-STREAMSYNC-001)
Total	28		2 necessary, 26 redundant

Fix (entrenar #240, trueno #171) — IMPLEMENTED:

Two new PTX kernels in trueno-gpu/src/kernels/optimizer/fused_clip.rs:

ClipScaleReduceKernel: Single-CTA, single-thread. Reads contiguous f32[total_partials] buffer of squared-sum partial results, computes clip_scale = min(1.0, max_norm / sqrt(sum)). IEEE 754 handles zero-norm without branching (div(x, 0.0) = +inf, min(+inf, 1.0) = 1.0). Writes output[0] = scale, output[1] = norm for observability.
GradientClipGpuScaleKernel: Element-wise. Reads scale from GPU pointer (not host param). Early exit when scale ≈ 1.0 (within 1e-7) to avoid unnecessary memory bandwidth when no clipping needed.

Integration in entrenar/src/autograd/cuda_optim.rs:

FusedClipState: Pre-allocated contiguous partials buffer + scale buffer
squared_sum_launch_into: Writes partial sums at offset into contiguous buffer
clip_scale_reduce_cuda: Launches ClipScaleReduceKernel (grid 1×1, block 1×1)
gradient_clip_gpu_scale_cuda: Launches GradientClipGpuScaleKernel

Pipeline (per block): 9× squared_sum_launch_into → 1× clip_scale_reduce → 9× gradient_clip_gpu_scale. Zero sync points, zero D2H transfers.

This eliminates 26 of 28 syncs/step. The 2 remaining are irreducible:

CE loss download for NaN guard
Final sync before embed gradient D2H (C-STREAMSYNC-001)

Status: Implemented, compiles, awaiting dogfood on next training restart. Expected impact: step time 618ms → ~500ms (~20% improvement).

6.15 Training Quality Analysis (ALB-079/080, Five Whys)

Observation: v3 training at step 26K shows val_loss plateau at 6.92 (val_ppl=1000) since step 12K. Gradient norm collapsed from 3.04 (step 1K) to 0.15 (step 26K) — 20x decrease while lr is at peak (3e-4).

Five Whys — Root Cause 1: Missing Cosine LR Decay (ALB-079)

Why constant lr=3e-4 at all steps? CudaTransformerTrainer::current_lr() only implemented linear warmup; returned base_lr after warmup (line 1938)
Why no cosine? TransformerTrainConfig has no lr_scheduler field; YAML config parsed by bridge but not propagated to CUDA path
Why not caught earlier? At step 2K-5K, cosine barely differs from constant (lr ≈ 2.99e-4 vs 3.00e-4); plateau only visible after 10K steps
Fix (entrenar #241): Cosine decay in current_lr() using warmup_steps and max_steps. CPU embedding optimizer synced via set_lr().

Five Whys — Root Cause 2: Effective Batch Size 48-128x Too Small (ALB-080)

Why val_ppl plateau at 1000? Gradient noise too high to escape loss basin
Why noisy gradients? Effective batch = 4 × 1 × 1024 = 4,096 tokens/step
Why 4,096? gradient_accumulation: 1 in config, VRAM limits batch_size: 4
Why so small? Config was set for debugging; no Chinchilla batch size analysis
Why does it matter? Comparable 350M models use 131K-524K tokens/step (32-128x larger)

Model	Batch Size (tokens/step)
CodeGen-350M-mono	~500K+
CodeParrot-small (110M)	196K
GPT-2 124M (nanoGPT)	~524K
Albor v3	4,096
Albor v4 (planned)	131,072

Fix: pretrain-350m-v4.yaml with gradient_accumulation: 32 (131K tokens/step), warmup_steps: 375, max_steps: 7500 (~1B tokens). Same wall-clock as v3 (same number of forward/backward passes), dramatically better gradient quality.

Expected impact: val_ppl should break through 1000 floor and reach <100 by 1B tokens. gnorm should stabilize at 0.5-2.0 (not collapse to 0.13).

7. Verification Architecture

7.1 Four-Layer Verification

Layer 1: CONTRACTS (provable-contracts / pv)
  What: Algebraic invariants, proof obligations, falsification tests
  When: BEFORE implementation (write contract first)
  How:  pv validate, pv scaffold, pv audit
  Files: contracts/cublas-gemm-v1.yaml
         contracts/training-step-budget-v1.yaml

Layer 2: BENCHMARKS (raw C ceiling + Criterion + regression detection)
  What: Three-tier GEMM comparison with hardware ceiling
  When: BEFORE (ceiling), DURING (Criterion), AFTER (regression)
  How:  make bench-gemm-compare, make bench-gemm-regression
  Pattern: Raw C cuBLAS (ceiling) vs Rust cuBLAS (target) vs PTX (floor)
    - FFI overhead < 2% (Rust vs Raw C)
    - Speedup > 10x (cuBLAS vs PTX)
    - Regression < 10% per shape between commits
    - Follows trueno/benchmarks/ matmul_comparison.py pattern exactly

Layer 3: BRICK PROFILING (probador)
  What: Per-component time budgets with Jidoka gates
  When: DURING implementation (continuous enforcement)
  How:  BrickHouse builder, brick assertions, budget_ms
  Pattern: Each training loop component = one Brick with:
    - can_render() = Jidoka gate (fail if > 2x budget)
    - verify() = timing assertion
    - budget_ms = SLA from contract

Layer 4: LAYER TRACING (renacer BrickTracer)
  What: Per-kernel, per-block, per-transfer timing with OTLP export
  When: DURING profiling runs + AFTER implementation (regression detection)
  How:  BrickTracer.trace(), OTLP -> Jaeger, anomaly escalation
  Pattern: Each CUDA kernel call = one trace span
    - Forward: block_N_gemm_qkv, block_N_attention, block_N_ffn
    - Backward: block_N_backward_gemm, block_N_backward_elementwise
    - Transfer: pcie_h2d_embed, pcie_d2h_logits, pcie_h2d_grad
    - Optimizer: block_N_optimizer_d2h, block_N_adamw, block_N_optimizer_h2d

7.2 Escalation Chain

Renacer implements automatic escalation from lightweight metrics to detailed tracing:

Steady state (metrics only):
  - Counter: gemm_calls_total, pcie_bytes_total
  - Gauge: step_time_ms, mfu_ratio
  - Histogram: per_block_forward_us, per_block_backward_us

Escalation trigger (CV > 15% or efficiency < 25%):
  - BrickTracer captures full syscall breakdown
  - OTLP spans exported to Jaeger with per-kernel detail
  - Anomaly detector flags the brick and step number

Alert (budget violation > 2x):
  - Jidoka gate fires (probador)
  - Training loop pauses (Andon alert)
  - Full trace exported for post-mortem

This means training runs at full speed in steady state (metrics are SIMD- accelerated via trueno), and only pays the tracing cost when something goes wrong.

7.3 Continuous Verification During Training

# Run training with BrickTracer instrumentation
RUST_LOG=info renacer --otlp-endpoint http://localhost:4317 \
    --otlp-service-name "albor-v3-cublas" \
    --trace-compute \
    --trace-compute-threshold 100 \
    -- apr train apply --task pretrain \
        --config configs/train/pretrain-350m-v3.yaml

# In another terminal: monitor brick budgets
apr monitor ./checkpoints/albor-base-350m-v3/

# Post-run: audit contract compliance
pv audit contracts/cublas-gemm-v1.yaml \
    --binding contracts/trueno-gpu/cublas-binding.yaml
pv audit contracts/training-step-budget-v1.yaml \
    --binding contracts/entrenar/step-budget-binding.yaml

# Post-run: view traces in Jaeger
# http://localhost:16686 -> Service: "albor-v3-cublas"
# Filter by: operation="gemm_forward", minDuration=10ms

8. Risks

Risk	Mitigation	Contract Obligation
cuBLAS FP16 numerical divergence	Keep FP32 master weights, compare loss curves	FALSIFY-CUBLAS-002
libcublas.so version mismatch	Pin to CUDA 12.x, test on lambda machine	FALSIFY-CUBLAS-003
cuBLAS workspace memory pressure	Pre-allocate fixed workspace, share across GEMMs	training-memory-kernel-v1
CPU optimizer becomes new bottleneck	Phase 4 contract (gpu-optimizer-v1)	FALSIFY-BUDGET-002
Tensor core shapes require padding	Albor shapes (1024, 4096, 32768) already multiples of 8	FALSIFY-CUBLAS-003
FP16 weight precision loss	Standard practice; master weights remain FP32 on CPU	FALSIFY-CUBLAS-002
Silent regression after optimization	Brick budgets + Jidoka gates detect immediately	FALSIFY-BUDGET-003
Unaccounted overhead hiding bottleneck	Brick coverage >= 95% of step time enforced	FALSIFY-BUDGET-001

9. Dependencies

libcublas.so from CUDA toolkit (already installed: /usr/local/cuda/lib64/)
nvcc for compiling raw C cuBLAS benchmark (ceiling measurement)
trueno-gpu crate (target for FFI integration)
entrenar CudaTransformerTrainer (consumer of cuBLAS GEMMs)
renacer BrickTracer (layer tracing instrumentation)
probador brick budgets (SLA enforcement)
provable-contracts / pv (contract validation and audit)
Criterion.rs (Rust benchmark harness, already a trueno dev-dependency)
No new Rust crate dependencies (pure FFI, no bindgen)

10. Contract Registry

Contract File	Status	Validates
`contracts/cublas-gemm-v1.yaml`	NEW (write before Phase 1)	cuBLAS correctness, buffer safety, MFU improvement
`contracts/training-step-budget-v1.yaml`	NEW (write before Phase 0)	Brick-level performance SLAs, Jidoka enforcement
`contracts/training-gpu-kernel-v1.yaml`	EXISTING	Parent contract — PCIe transfers, stability, gradient flow
`contracts/training-memory-kernel-v1.yaml`	EXISTING	VRAM budget (must update for FP16 weight storage)
`contracts/training-config-kernel-v1.yaml`	EXISTING	Epoch/step/LR algebraic consistency
`contracts/fused-kernels-v1.yaml`	NEW (write before Phase 4)	Fused CE, RMS norm reuse, SwiGLU in-place, fused attention
`contracts/gpu-optimizer-v1.yaml`	FUTURE (Phase 4)	GPU-resident AdamW correctness
`contracts/gpu-embedding-v1.yaml`	FUTURE (Phase 5)	GPU embedding lookup + scatter-add
`contracts/async-pipeline-v1.yaml`	FUTURE (Phase 6)	Compute/transfer overlap safety
`contracts/grad-checkpoint-v1.yaml`	FUTURE (Phase 7)	Gradient checkpointing memory/correctness

11. Unsloth-Inspired Kernel Optimizations

Source: Analysis of unslothai/unsloth (cloned 2026-03-05). Unsloth achieves 2x training speedup over HuggingFace via fused Triton kernels, selective activation saving, and in-place backward ops. These patterns translate to our Rust + CUDA PTX stack.

11.1 Fused Cross-Entropy Loss + Backward

What unsloth does: Single Triton kernel computes logsumexp, loss, and dL/dx (softmax - one_hot) in one pass. Never materializes the full probability distribution.

Current albor: Separate kernels for logits→softmax, softmax→loss, loss→grad. For vocab=32K, batch=4, seq=1024, the logit tensor is [4096, 32768] = 512 MB in FP32. Three kernel launches + three full reads/writes of this tensor.

Proposed change: Fused CE kernel that:

Computes logsumexp per row (FP32 accumulation for stability)
Computes loss = logsumexp - logit[label] per row
Computes grad[i] = exp(logit[i] - logsumexp) - delta(i, label) in-place
Never allocates full softmax tensor

Expected gain: -2 kernel launches, -1 GB memory bandwidth per step. Step time: ~20-40ms savings (CE is ~1% of step time, but memory bandwidth relief helps other kernels via improved cache pressure).

Contract: contracts/fused-kernels-v1.yaml — FALSIFY-FUSED-001

Equations:
  fused_ce_correctness:
    loss_fused = -logit[label] + log(sum(exp(logit[i]))) for each row
    grad_fused[i] = exp(logit[i] - logsumexp) - delta(i, label)
  Invariant: max_abs_diff(loss_fused, loss_separate) < 1e-5
  Invariant: max_abs_diff(grad_fused, grad_separate) < 1e-5
  Invariant: FP32 accumulation for logsumexp (no FP16 overflow on 32K vocab)

11.2 Activation Memory Reuse (RMS LayerNorm)

What unsloth does: RMS LayerNorm forward saves ONLY inv_var (1 scalar per row = batch * seq_len floats). Backward recomputes normed = X * inv_var from the activation cache. Total saved: O(B*S) instead of O(B*S*H).

Current albor: Saves X, W, inv_var, and normed per layer during forward for use in backward. For 24 layers × [4096, 1024]:

X: 24 × 16 MB = 384 MB
normed: 24 × 16 MB = 384 MB
inv_var: 24 × 16 KB = 384 KB (negligible)
Total saved: 768 MB of activation memory

Proposed change: Save only inv_var per layer. During RMS norm backward:

Recompute normed = X_cached * inv_var (X is available from the previous layer’s output or the activation cache)
Compute d_weight = sum(grad_output * normed)
Compute d_input = (grad_output * W - normed * d_weight_sum) * inv_var

Expected gain: -384 MB activation memory (normed tensor eliminated). This is 3.2% of 24 GB VRAM — modest alone, but compounds with other savings to potentially enable batch=8 without gradient checkpointing.

Contract: contracts/fused-kernels-v1.yaml — FALSIFY-FUSED-002

Equations:
  rmsnorm_recompute_correctness:
    normed_recomputed = X * inv_var_saved
    max_abs_diff(normed_recomputed, normed_original) == 0.0  (exact, same FP32)
  Memory reduction:
    activation_memory(optimized) = activation_memory(current) - 24 * B * S * H * 4 bytes
    For B=4, S=1024, H=1024: savings = 24 * 4 * 1024 * 1024 * 4 = 402,653,184 bytes (~384 MB)

11.3 SwiGLU In-Place Backward

What unsloth does: GEGLU/SwiGLU backward overwrites input buffers with gradient results. Forward: h = silu(e) * g. Backward stores dh, de, dg into the same memory as h, e, g. No new allocations.

Current albor: CudaGradWorkspace reuses buffers per-block (already good), but within a block, SwiGLU backward allocates separate grad_gate, grad_up, and grad_down buffers. For intermediate_size=4096:

grad_gate: [4096, 4096] = 64 MB
grad_up: [4096, 4096] = 64 MB
Total per-block overhead: 128 MB (shared workspace, so only peak matters)

Proposed change: Fuse SwiGLU backward to overwrite gate/up buffers in-place:

d_gate = grad_output * up * silu_deriv(gate) → store in gate buffer
d_up = grad_output * silu(gate) → store in up buffer
No separate allocation for d_gate, d_up

Expected gain: -128 MB peak workspace per block (already shared, so reduces peak VRAM, not total allocations). Main benefit is reduced memory bandwidth — fewer buffer copies between kernels.

Contract: contracts/fused-kernels-v1.yaml — FALSIFY-FUSED-003

Equations:
  swiglu_inplace_correctness:
    d_gate_inplace = grad_out * up * sigmoid(gate) * (1 + gate * (1 - sigmoid(gate)))
    d_up_inplace = grad_out * silu(gate)
    max_abs_diff(d_gate_inplace, d_gate_separate) < 1e-5
    max_abs_diff(d_up_inplace, d_up_separate) < 1e-5

11.4 Mixed Precision Discipline (Validated)

What unsloth does: Loads activations as FP32 for critical arithmetic (variance, softmax, logsumexp), keeps weights in BF16, casts output back after critical ops.

Albor status: Already implemented correctly (validated by ALB-072 fix). Our backward is all FP32, master weights are FP32 on CPU, forward weights are FP32 on GPU (will become FP16 with cuBLAS). This matches unsloth’s pattern.

Action: No code change needed. Document as validation that our approach matches production-grade mixed precision practice.

11.5 RoPE Head Grouping

What unsloth does: Applies RoPE to 4 heads simultaneously, loading sin/cos once and reusing across the group. ROPE_GROUP_SIZE = 4.

Current albor: Per-head RoPE application in the attention forward kernel. Sin/cos recomputed or reloaded per head.

Proposed change: Batch RoPE across all Q heads (16) and KV heads (4) with single sin/cos load. For our GQA architecture (16 Q heads, 4 KV heads):

Q: load sin/cos once, apply to 16 heads
K: same sin/cos, apply to 4 heads
V: no RoPE (not rotated)

Expected gain: ~10% attention kernel speedup from better L2 cache utilization. Small absolute impact (~5-10ms/step) since RoPE is not compute-dominant.

Contract: contracts/fused-kernels-v1.yaml — FALSIFY-FUSED-004

Equations:
  rope_grouped_correctness:
    For each head h in [0, n_heads):
      Q_rotated_grouped[h] == Q_rotated_individual[h]  (bit-exact)
    Performance: T_rope(grouped) < 0.9 * T_rope(individual)

11.6 Fused Attention (QK^T → Softmax → V)

What unsloth does: Uses Flash Attention or Flex Attention to fuse the 3-step attention computation into a single kernel. Never materializes the full [seq, seq] attention score matrix.

Current albor: Three separate operations per attention head:

scores = Q @ K^T → cuBLAS GEMM → [4096, 1024] (with cuBLAS)
probs = softmax(scores / sqrt(d_k)) → elementwise kernel
output = probs @ V → cuBLAS GEMM

This materializes the [batch, heads, seq, seq] = [4, 16, 1024, 1024] = 256 MB attention score tensor. For 24 layers, that’s 6.1 GB if all layers’ scores are live simultaneously (they aren’t in our per-block architecture, but the per-block peak still includes this).

Proposed change: Custom fused attention kernel (not Flash Attention — our seq=1024 is short enough that tiled online softmax gives most of the benefit):

Tile Q, K, V into blocks (e.g., 64×64)
Compute QK^T tile, apply causal mask, running softmax (online algorithm)
Accumulate softmax(tile) @ V without materializing full score matrix
Output: attention result directly, save only logsumexp for backward

Expected gain:

-256 MB peak VRAM per block (attention scores not materialized)
-2 kernel launches per layer (3→1)
~15% attention speedup from reduced memory bandwidth
Enables batch=8 by freeing VRAM headroom

Contract: contracts/fused-kernels-v1.yaml — FALSIFY-FUSED-005

Equations:
  fused_attention_correctness:
    output_fused = softmax(Q @ K^T / sqrt(d_k) + causal_mask) @ V
    max_abs_diff(output_fused, output_separate) < 1e-3  (FP32)
    max_abs_diff(output_fused, output_separate) < 1e-2  (FP16)
  Memory:
    peak_attn_memory(fused) < peak_attn_memory(separate) / 4
    # Separate: [B, H, S, S] = 256 MB
    # Fused: [B, H, tile, tile] = 256 MB / (S/tile)^2

11.7 Chunked Cross-Entropy for Future Vocab Scaling

What unsloth does: For vocab > 65K, splits logsumexp computation into chunks of 65536. Mathematical property: logsumexp(chunked_logsumexp) == logsumexp(full).

Current albor: Vocab = 32K, fits in single chunk. Not needed now.

Future applicability: If we scale to multi-lingual (65K+ vocab) or adopt a larger tokenizer, chunked CE prevents register pressure overflow in the fused CE kernel. The logsumexp decomposition is:

logsumexp([a, b]) = max(a, b) + log(exp(a - max) + exp(b - max))

Each chunk computes a partial logsumexp. The final logsumexp combines partials. This is numerically stable and mathematically exact.

Contract: Deferred until vocab > 65K. Will be added to fused-kernels-v1.yaml if tokenizer v3 exceeds 65K vocabulary.

11.8 Gradient Checkpointing (Activation Recomputation)

What unsloth does: Trades compute for memory by recomputing layer activations during backward instead of saving them during forward. 2x slower backward, but ~3x smaller activation memory.

Current albor: Per-block interleaved backward+optimizer design already limits peak activation memory to one block’s worth. But with fused attention (§11.6) and activation reuse (§11.2), we may not need gradient checkpointing for batch=4.

When needed: If batch=8 + seq=2048 still OOMs after §11.2 + §11.6.

Contract: contracts/grad-checkpoint-v1.yaml (FUTURE — already in registry)

Equations:
  checkpoint_correctness:
    grad(checkpointed) == grad(full_save)  # Bit-exact: same computation
  Memory:
    peak_activation(checkpointed) = peak_activation(full) / num_checkpoint_segments
  Performance:
    T_backward(checkpointed) < 2.0 * T_backward(full)  # At most 2x slower

11.9 Summary: Optimization Priority Matrix

#	Optimization	Expected Gain	Memory Savings	Effort	Phase
1	cuBLAS tensor core GEMMs	50x GEMM, 2x step	0	High	1-3
2	Fused CE loss + backward	20-40ms/step	-512 MB bandwidth	Medium	4
3	RMS norm activation reuse	0 (compute)	-384 MB	Low	4
4	SwiGLU in-place backward	10-20ms/step	-128 MB peak	Low	4
5	RoPE head grouping	5-10ms/step	0	Low	4
6	Fused attention (tiled)	15% attn speedup	-256 MB/layer	High	5
7	Chunked CE (vocab >65K)	0 (future)	0	Low	Deferred
8	Gradient checkpointing	-2x backward	-66% activations	Medium	7

Cumulative impact (Phases 1-5b, measured):

Step time: 4,400ms → 444ms (9.9x; cuBLAS SIMD 5.9x, batched RMSNorm 24.8x fwd)
MFU: 2.5% → 26.7% (vs FP32 peak, runtime-reported)
Tok/s: 934 → 9,216 (9.9x improvement)
Note: Tensor cores disabled (ALB-076, §6.12) — produce NaN in transposed backward GEMMs

11.10 Falsification Tests for Kernel Optimizations

ID	Rule	Prediction	Contract
FALSIFY-FUSED-001	Fused CE matches separate CE	`max_abs_diff(loss) < 1e-5` on 50M model, 50 steps	fused-kernels-v1
FALSIFY-FUSED-002	RMS norm recompute is bit-exact	`normed_recomputed == normed_original` (FP32, exact)	fused-kernels-v1
FALSIFY-FUSED-003	SwiGLU in-place backward correct	`max_abs_diff(d_gate, d_gate_ref) < 1e-5`	fused-kernels-v1
FALSIFY-FUSED-004	RoPE grouped matches individual	Bit-exact Q_rotated for all 16 heads	fused-kernels-v1
FALSIFY-FUSED-005	Fused attention matches separate	`max_abs_diff(output) < 1e-3` (FP32)	fused-kernels-v1
FALSIFY-FUSED-006	Memory savings measured	Activation peak reduced by >= 300 MB	fused-kernels-v1
FALSIFY-FUSED-007	Fused CE never materializes softmax	Peak memory during CE < `BSV*4` bytes	fused-kernels-v1
FALSIFY-FUSED-008	Gradient checkpointing bit-exact	`grad(checkpointed) == grad(full)` for all params	grad-checkpoint-v1
FALSIFY-FUSED-009	Fused attention backward correct	All params get gradients, loss within 1% of separate	fused-kernels-v1
FALSIFY-FUSED-010	No training instability from fusions	100-step run: loss.is_finite() every step, gnorm < 100	fused-kernels-v1

Appendix A: Popperian Falsification of This Specification

Date: 2026-03-05 Method: batuta falsify . (108-item checklist) + manual chain-of-thought analysis of every claim, equation, and assumption in this spec.

Batuta project score: 80.1% (Andon Warning), 65 PASS, 0 FAIL, 43 PARTIAL. Key findings from batuta mapped to spec weaknesses below.

A.1 Chain-of-Thought Falsification

Each numbered item is a falsifiable claim from the spec, followed by the attempt to break it.

Claim 1: “Step time is 4,400ms with 57% in GEMM” (Section 2.4)

Status: UNVERIFIED ESTIMATE. The breakdown is labeled “Estimated” but no profiling data backs it. The spec prescribes renacer BrickTracer profiling in Phase 0, but Phase 0 hasn’t run yet. The 57% GEMM figure is a guess.
Risk: If GEMM is actually 30% of step time (e.g., CPU optimizer is 40%), cuBLAS integration yields only 1.3x speedup instead of 2x.
Action: Phase 0 is blocking. Do not proceed to Phase 1 until BrickTracer confirms the breakdown. Add a contract obligation: FALSIFY-BASELINE-001.

Claim 2: “cuBLAS achieves 130-150 TFLOP/s on Albor shapes” (Section 4.1)

Status: VERIFIED. Measured 152.3 TFLOP/s on FFN gate/up shape [4096, 1024] x [1024, 4096], 141.2 TFLOP/s on FFN down, 89.4 TFLOP/s on square [1024, 1024]. The range 89-152 TFLOP/s matches or exceeds the 130-150 prediction for large shapes. Smaller square shapes are memory-bandwidth bound as expected.
Verification: trueno-gpu cuBLAS hardware tests (PR #165).

Claim 3: “FFI overhead < 2%” (Section 5.7, FALSIFY-CUBLAS-008)

Status: PLAUSIBLE but untested. cuBLAS FFI is a single function call with no data copies (pointers passed through). 2% overhead is reasonable.
Risk: If CublasHandle::set_stream() is called per-GEMM (555 calls/step) rather than once per step, the cumulative overhead could exceed 2%.
Action: The wrapper should call set_stream() once at step start, not per-GEMM. Add this as a contract invariant.

Claim 4: “MFU = 2.5% vs FP32 peak” (Section 1.2)

Status: PARTIALLY FALSIFIED. The MFU formula uses 6 * P * tokens_per_step but this approximation assumes all FLOPs are in GEMMs. For a 370M model with batch=4, seq=1024, the attention score computation (QK^T) adds 2 * S^2 * H * L = 2 * 1024^2 * 1024 * 24 = 51.5 GFLOP per step, which is <1% of the 9.1 TFLOP total. The 6x approximation is valid here.
Correction: MFU is correct to within ~1% of the true value. No action needed.

Claim 5: “Step time drops to 2,150ms after cuBLAS” (Section 6.1)

Status: MEASURED — 1,379 ms (better than projected). The original projection of 2,150ms assumed non-GEMM time stays constant at 1,900ms. Actual measurement showed 1,379ms (seq=512, batch=4), which is 36% better than projected. Verified via dogfooding: apr train apply with cuBLAS (entrenar PR #233), 1,485 tok/s, 4.3% MFU.
FALSIFY-CUBLAS-009 still relevant: verify non-GEMM time decomposition.

Claim 6: “555 GEMM operations per step” (Section 2.1)

Status: APPROXIMATELY CORRECT but undercounted. The count includes attention score GEMMs (QK^T) but omits attention value application (V projection after softmax), which is also a GEMM: softmax(QK^T) * V. Forward: 24 blocks x 1 = 24. Backward: 24 blocks x 2 = 48. Plus attention backward for the score GEMM itself.
Correction: The actual count may be ~600 GEMMs, not 555. The difference is small (<10%) and doesn’t change the analysis materially, but the spec should note the approximation.

Claim 7: “Phase 7 achieves 17.5% MFU with batch=8” (Section 6.3)

Status: CONTRADICTS KNOWN CONSTRAINT. Section 4.3 of the spec notes seq=1024, batch=8 currently OOMs. Phase 7 lists this as requiring gradient checkpointing, but with cuBLAS adding FP16 weight copies alongside FP32 master weights, VRAM pressure increases. The 650ms step time assumes batch=8 fits, which is unproven.
Risk: batch=8 may still OOM even with gradient checkpointing if FP16+FP32 dual weight storage consumes the headroom.
Action: Add VRAM budget equation to training-memory-kernel-v1.yaml for mixed-precision dual storage. FALSIFY-MEM-004: “batch=8 fits in 24GB with FP16 forward weights + FP32 master weights + gradient checkpointing.”

Claim 8: “Benchmark shapes are representative” (Section 5.2)

Status: INCOMPLETE. The 6 benchmark shapes cover the large GEMMs but omit the GQA key-value projection shapes: [4096, 256, 1024] (K and V projections with num_kv_heads=4, head_dim=64, so kv_dim=256). These are small, thin matrices where cuBLAS may show less speedup due to low arithmetic intensity.
Action: Add (4096, 256, 1024, "attn_kv") to SHAPES in both C and Criterion benchmarks. This is the worst-case shape for tensor cores.

Claim 9: “Performance regression gate at 10%” (Section 5.5)

Status: MATCHES batuta JA-04 finding. Batuta flagged JA-04 (Performance Regression Gate) as PARTIAL with rejection “Benchmarks exist but not gated in CI.” The spec defines make bench-gemm-regression but does not integrate it into CI.
Action: Add bench-gemm-regression to the clean-room / gate CI workflow for trueno-gpu. This addresses JA-04.

Claim 10: “No new Rust crate dependencies” (Section 9)

Status: CORRECT. Pure FFI bindings require only libc types (already in std) and libcublas.so (system library). No cublas-sys or bindgen crate needed.
Verified: This is consistent with trueno’s existing pattern of hand-written CUDA driver API bindings.

A.2 Batuta Findings Mapped to Spec

Batuta ID	Status	Spec Impact
JA-04	PARTIAL: “Benchmarks not gated in CI”	Section 5: Add bench-gemm-regression to CI
PW-02	PARTIAL: “No SIMD optimization”	N/A (spec is about GPU, not CPU SIMD)
EDD-01	PARTIAL: “Partial equation documentation”	Section 3.1: Ensure all contract equations have domain/codomain/invariants
EDD-03	PARTIAL: “Numerical code without analytical validation”	Section 5.2: Raw C baseline IS the analytical validation
NR-01	PARTIAL: “No explicit IEEE 754 testing”	Add: cuBLAS FP32 accumulation contract (C-CUBLAS-004) covers this
NR-02	PARTIAL: “Single platform testing”	N/A (CUDA-only by design, RTX 4090 target)
AI-01	PARTIAL: “Config examples incomplete”	Add cuBLAS config example to YAML configs
AI-05	PARTIAL: “No explicit validator”	`apr train validate` already validates; extend for cuBLAS feature

A.3 Missing Falsification Tests (Discovered by Chain-of-Thought)

The following tests are NOT in the current contract but SHOULD be:

# Add to cublas-gemm-v1.yaml

  - id: FALSIFY-CUBLAS-009
    rule: "Non-GEMM overhead does not increase after cuBLAS"
    prediction: "T_non_gemm(cublas) < 1.1 * T_non_gemm(ptx)"
    test: |
      Profile 50 steps with PTX, measure total non-GEMM time.
      Profile 50 steps with cuBLAS, measure total non-GEMM time.
      Ratio must be < 1.10.
    if_fails: "FP16 casting, handle creation, or workspace allocation adds overhead"

  - id: FALSIFY-CUBLAS-010
    rule: "GQA thin-matrix GEMM still benefits from cuBLAS"
    prediction: "cuBLAS [4096, 256, 1024] > 50 TFLOP/s"
    test: |
      Run isolated GEMM on K/V projection shape [4096, 256, 1024].
      Must exceed 50 TFLOP/s (lower bar than large shapes due to
      low arithmetic intensity).
    if_fails: "Thin matrices memory-bandwidth-bound, not compute-bound"

  - id: FALSIFY-CUBLAS-011
    rule: "cuBLAS column-major convention handled correctly"
    prediction: "Row-major Rust buffers produce correct results via transpose flags"
    test: |
      Compute C = A * B in row-major (Rust native) using cuBLAS with
      appropriate CUBLAS_OP_T flags. Compare against known-good reference.
      All 7 GEMM shapes in a single transformer block must match.
    if_fails: "Leading dimension or transpose convention wrong (ALB-059 class bug)"

# Add to training-step-budget-v1.yaml

  - id: FALSIFY-BUDGET-004
    rule: "Phase 0 baseline matches estimated breakdown"
    prediction: "Measured GEMM fraction is 50-65% of step time"
    test: |
      Run BrickTracer profiling for 50 steps on PTX backend.
      T_gemm / T_step must be in [0.50, 0.65].
    if_fails: "Estimated breakdown is wrong; re-derive all phase projections"

# Add to training-memory-kernel-v1.yaml

  - id: FALSIFY-MEM-004
    rule: "Mixed-precision dual storage fits in VRAM"
    prediction: "FP16 forward weights + FP32 master weights + optimizer < 24GB"
    test: |
      Compute: P * 2 (FP16 GPU) + P * 4 (FP32 CPU master, not on GPU)
      + P * 8 (AdamW m+v, on GPU) + workspace.
      P=370M: 0.74 GB (FP16) + 2.96 GB (AdamW) + workspace = ~15.5 GB.
      Must fit in 24 GB with seq=1024, batch=4.
    if_fails: "VRAM budget exceeded, batch=4 may OOM with mixed precision"

Claim 11: “TF32 tensor cores provide ~2x throughput” (Section 6.9, Phase 5a)

Status: FALSIFIED — REVERTED (ALB-076). TF32 tensor cores showed 0% improvement at 350M model size (§6.9). More critically, tensor core GEMM algorithms (CUBLAS_GEMM_DEFAULT_TENSOR_OP) produce ALL NaN output for transposed backward GEMMs when gradient magnitudes reach ~1e5 (§6.12).
Root cause: cuBLAS tensor core algorithm has undocumented numerical failure mode with transposed operands at high magnitudes. Forward (NoTrans/NoTrans) is unaffected.
Fix: Disabled tensor cores entirely (CUBLAS_DEFAULT_MATH). cuBLAS SIMD path still 5.9x faster than PTX. Phase 5a reverted (trueno #170).
Action: Phase 5a removed from optimization path. Added to bug pattern catalog.

A.4 Unrealistic Assumptions Identified

Assumption	Section	Reality Check
GEMM is 57% of step time	2.4	Unverified estimate. Phase 0 must confirm.
cuBLAS achieves 130-150 TFLOP/s	4.1	Depends on shape. May be 80-120 on rectangular.
Non-GEMM time stays constant	6.1	FP16 casting adds new overhead.
2% FFI overhead	5.7	Plausible but requires per-GEMM vs per-step stream binding.
batch=8 fits with grad ckpt	6.3	Dual precision increases VRAM. Unproven.
165 TFLOP/s is achievable peak	1.2	Marketing spec. Sustained is ~145-150 TFLOP/s.

A.5 Recommended Spec Revisions

Gate Phase 1 on Phase 0 completion. Do not write cuBLAS code until BrickTracer confirms the estimated breakdown.
Add GQA thin-matrix shape [4096, 256, 1024] to all benchmarks.
Add FALSIFY-CUBLAS-009 (non-GEMM overhead preservation).
Add FALSIFY-CUBLAS-010 (thin-matrix performance floor).
Add FALSIFY-CUBLAS-011 (column-major convention correctness).
Add FALSIFY-BUDGET-004 (baseline confirmation gate).
Add FALSIFY-MEM-004 (mixed-precision VRAM budget).
Integrate bench-gemm-regression into CI (addresses batuta JA-04).
Use sustained peak (~148 TFLOP/s) instead of marketing peak (165) for MFU calculations.
Note set_stream() binding scope in cublas.rs contract: once per step, not per GEMM.

Model Card: albor-base-50m

Model Details

Field	Value
Name	albor-base-50m
Version	1.0 (pipeline validation)
Type	Decoder-only Transformer (LLaMA-style)
Parameters	~62M (hidden=512, layers=12 — “50M” is approximate label)
Architecture	hidden=512, layers=12, heads=8, kv_heads=2, ffn=2048
Vocab Size	32,768 (BPE, whitespace-split v1; later upgraded to ByteLevel BPE v2)
Context Length	128 tokens (validation run; architecture supports 2048)
Training Data	500 rows Python code, 64K tokens
Training Time	110.7 seconds (CUDA on RTX 4090)
Framework	entrenar + realizar (CUDA, CudaTransformerTrainer)

Intended Use

Pipeline validation only. This model validates that the albor training stack (alimentar → entrenar → realizar) works end-to-end. It is NOT intended for code completion or any production use.

Training Details

Optimizer: AdamW (lr=6e-4, β1=0.9, β2=0.95, wd=0.1)
Steps: 31 optimizer steps (125 batches, gradient_accumulation=4)
Mixed Precision: fp16
Loss: 10.335 → 4.423 (perplexity 30,802 → 5.4)
Compute: 76.8s CUDA matmul (69%), 32.9s transpose (30%), 0.9s alloc (1%)

Tokenizer

Type: BPE with split_whitespace() pre-tokenizer + </w> suffix
Vocab: 32,768 tokens
Known Limitation: Normalizes whitespace (loses Python indentation)
Source: Trained with apr tokenize apply on 100K lines of Python code

FALSIFY Predictions

ID	Prediction	Status
FALSIFY-ALBOR-001	Loss decreases monotonically	CORROBORATED (10.3→4.42)
FALSIFY-ALBOR-002	Gradient norms bounded	PENDING (per-step logging now available, ALB-035 FIXED)
FALSIFY-ALBOR-003	Checkpoint determinism	UNTESTED

Limitations

Whitespace normalization in tokenizer makes output invalid Python
Only 500 training rows (not representative of target distribution)
Short context (128 tokens, not production 2048)
No evaluation on code completion benchmarks (structural eval only)

Data Provenance

See docs/PROVENANCE.md for full SHA-256 hashes of all data artifacts.

Checkpoint

Path: checkpoints/albor-base-50m/model.safetensors (249 MB)
Metadata: checkpoints/albor-base-50m/final_model.json

Model Card: albor-base-350m

Model Details

Field	Value
Name	albor-base-350m
Version	1.0 (base pre-training)
Type	Decoder-only Transformer (Qwen2-style)
Parameters	398.5M
Architecture	hidden=1024, layers=24, heads=16, kv_heads=4, ffn=4096
Vocab Size	32,768 (ByteLevel BPE v2, whitespace-preserving)
Context Length	2,048 tokens
Training Data	v1: 22,079 seqs (45.2M tokens); v2: 67,977 seqs (139M tokens, Tier 1 10x + 8 Tier 2 repos + 50% FIM)
Training Time	~20 hours on RTX 4090 (full run); 396s for 50-step test
Framework	entrenar + realizar (CUDA, CudaTransformerTrainer)

Intended Use

Base pre-training model. This model learns Python code patterns from pre-tokenized data. It serves as the foundation for:

Knowledge distillation from Qwen3-Coder-Next (Phase 4)
Fine-tuning with LoRA (Phase 6)
Post-training optimization: pruning, merging, quantization (Phase 6)

Training Details

Optimizer: AdamW (lr=3e-4, beta1=0.9, beta2=0.95, wd=0.1)
Scheduler: Cosine with warmup (v1: 2000 steps; v2: 500 steps per C-TRAINCFG-001)
Gradient Accumulation: 128 (effective batch = 4 × 128 × 1024 = 512K tokens)
Mixed Precision: fp16
Epochs: v1: 117 (22K seqs); v2: 38 (68K seqs) — ALB-060: original epochs=1 was fatal
Max Steps: 5,000
Loss (50-step test): 10.39 → 5.92 (best 5.53) — convergence verified (post ALB-059 GEMM backward fix)
Perplexity (50-step test): ~31,926 (finite; random baseline ~32,768)
Loss (full run): TBD — first run failed (ALB-060), retraining with v2 config
Perplexity (full run): TBD
CUDA Mode: GPU-resident training via CudaTransformerTrainer (ALB-040), 3 PCIe transfers/step

Tokenizer

Type: ByteLevel BPE (v2)
Vocab: 32,768 tokens
Preserves: Whitespace, indentation, newlines (critical for Python)
Source: Trained with Python tokenizers library on 100K lines of Python code
Location: models/albor-tokenizer-v2/tokenizer.json

FALSIFY Predictions

ID	Prediction	Status
FALSIFY-ALBOR-001	Loss decreases monotonically	CORROBORATED (50M: 10.3→4.42; 350M CUDA 50-step: 10.39→5.92)
FALSIFY-ALBOR-002	Gradient norms bounded	PENDING (per-step logging available via ALB-035)
FALSIFY-ALBOR-003	Checkpoint determinism	UNTESTED

Evaluation

Benchmark	Metric	Result
Training loss (50-step test)	cross-entropy	10.39 → 5.92 (best 5.53)
Training perplexity (50-step test)	exp(loss)	~31,926 (finite)
Checkpoint validation	weights trained?	PASS (layers distinct, not init)
realizar inference	loads + generates?	PASS (218 tensors, 50 tokens generated)
HumanEval (20 problems)	pass@1	TBD (after full training)
Python intermediate (15 problems)	pass@1	TBD (after full training)

Limitations

139M tokens on v2 (typical base models train on 10B+ tokens)
Python-only training data (no multilingual code)
v2 dataset includes 50% FIM (PSM format via alimentar fim)
~~Checkpoint broken by ALB-038~~ FIXED — entrenar now saves trained weights correctly
~~Evaluation blocked by ALB-037~~ FIXED — realizar loads trained checkpoint, generates tokens

Known Gaps

ALB-035 (FIXED): Per-step loss logging via train_epoch_with_callback() (entrenar@5d41a96)
ALB-037 (FIXED): realizar now loads trained checkpoint, generates tokens (e2e verified with 350M)
ALB-038 (FIXED): Broken autograd in RMSNorm::forward_batched() and MultiHeadAttention::forward(). Fixed in entrenar@91ba9da and entrenar@1ede409. All 20 model parameters now receive gradients.
ALB-040 (VERIFIED): GPU-resident pretraining via CudaTransformerTrainer. 3 PCIe transfers/step vs ~16K. 350M CUDA test: 50 steps, loss 10.39→5.92 (best 5.53), checkpoint valid.
ALB-060 (FIXED): Training config epochs=1 only ran 43/5000 steps. C-TRAINCFG-001 contract written. v2 config uses epochs=38 with expanded 68K-sequence dataset.
ALB-041 (FIXED): D2D buffer size mismatch in backward_attention(). Fixed in entrenar@a48e3d2. Was blocking GPU backward pass.
ALB-043 (FIXED): backward_ffn buffer overflow + missing SwiGLU gradients. Fixed in entrenar@f7805f1.
ALB-044 (FIXED): Activation gradient clipping at GPU-CPU boundary + CPU optimizer hyperparams (beta2/wd mismatch). Fixed in entrenar@86eec38.
ALB-059 (FIXED): GEMM backward constructor args n/k swapped — output stride baked wrong into PTX, rows overflow 64× into adjacent optimizer states (m_w_k, v_w_k). Negative v values → sqrt(neg) = NaN in AdamW. Also zero-initialized all optimizer m/v buffers (cuMemAlloc returns uninitialized VRAM). Fixed in entrenar@846ae0c.

Data Provenance

See docs/PROVENANCE.md for full SHA-256 hashes of all data artifacts.

Checkpoint

Test checkpoint: checkpoints/albor-350m-cuda-test/model.safetensors (1.59 GB, 218 tensors)
Full checkpoint: checkpoints/albor-base-350m/model.safetensors (TBD — training in progress)
Metadata: checkpoints/albor-base-350m/final_model.json
Config (test): configs/train/pretrain-350m-cuda-test.yaml
Config (full): configs/train/pretrain-350m.yaml

Appendix A: Batuta Oracle Consultation

Query: “distributed LLM training across heterogeneous GPUs using sovereign AI stack”

Response (2026-03-01):

Primary: repartir (95% confidence) — distributed computing primitives
Supporting: entrenar (70%) — distributed_training pattern
Supporting: trueno (80%) — SIMD/GPU backend for compute acceleration

Appendix B: Stack Version Matrix

Last verified: 2026-03-02

Component	Version	Role in Albor
aprender (`apr`)	0.4.10 (7c27c2b3)	Unified CLI: train, tokenize, eval, distill, merge, export, publish, pipeline
entrenar	0.7.5 (with local patches: ALB-038/041/043/044 fixes)	Training engine, autograd, CudaTransformerTrainer, optimizers, LoRA
trueno	0.16.1	SIMD/GPU tensor backend
realizar	0.8.0	Inference engine (SafeTensors loading, teacher model, eval, serving)
alimentar	0.2.6	Data pipeline, Parquet I/O, HF Hub import, FIM transforms, mixing
repartir	2.0.3	Distributed compute (future: gradient sync)
forjar	1.0.0	Pipeline orchestration (DAG engine, infra + task resources)
presentar	0.3.2	Training visualization (TUI dashboards, WASM, experiment browser)
bashrs (Rash)	6.65.0	Makefile lint/purify/classify, shell safety, pipeline command validation
batuta	0.7.2	Stack orchestration, oracle, falsification (108 checks), playbook DAG engine
provable-contracts (`pv`)	0.1.0	Design-by-contract YAML specs, Kani proofs, falsification tests
pmat	3.6.1	TDG scoring, comply check, fault patterns, coverage gaps
certeza	latest	Three-tier test effectiveness (unit → property → formal)
renacer	latest	Tracing infrastructure (BrickTracer, spans, metric events)

Note: apr uses [patch.crates-io] to override entrenar/realizar with local paths. The installed entrenar 0.7.5 includes unpublished fixes for ALB-038, ALB-041, ALB-043, ALB-044 (gradient flow, buffer sizes, activation clipping, optimizer hyperparams).

Appendix C: Qwen3-Coder-Next Architecture Details

Layer Pattern	Count	Description
Gated DeltaNet → MoE	36 (3 per block × 12 blocks)	Linear attention with gating, routed to 10/512 experts
Gated Attention → MoE	12 (1 per block × 12 blocks)	Standard GQA with gating, routed to 10/512 experts
Total layers	48

This hybrid architecture means realizar needs to support:

DeltaNet (linear attention variant) — likely a new gap
MoE routing (top-k expert selection) — may partially exist
Gated variants of both attention types

Appendix D: W5700X Vulkan Validation

The W5700X has been validated with trueno’s wgpu backend on Metal (macOS) with documented performance numbers (trueno book, 2026-01-03). The intel box runs Linux, so the backend will be Vulkan (not Metal). Vulkan support for RDNA 1 on Linux via Mesa RADV is mature and well-tested.

Action item: Run trueno GPU tests on intel via Vulkan to confirm parity with Metal benchmarks before relying on W5700X for compute tasks.

Appendix E: Leaderboard Strategy

E.1 Target: Big Code Models Leaderboard

URL: https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard

The Big Code Models Leaderboard is the standard HuggingFace scoreboard for code generation models. It evaluates HumanEval (Python pass@1) and MultiPL-E (18 languages) with throughput measurements. ~60 models currently listed.

Why this leaderboard:

Code generation focus — matches Albor’s use case exactly
HumanEval is our primary benchmark
Accepts community submissions via PR
No sub-1B model has ever appeared — Albor would be the first

Current smallest entries (1B tier):

Model	Params	HumanEval pass@1
phi-1	1.3B	50.6%
DeciCoder-1B	1.0B	19.3%
SantaCoder	1.1B	18.1%
StarCoderBase-1B	1.0B	15.2%

Albor’s position: At >15% HumanEval with 350M params, Albor would be competitive with the 1B tier at 1/3 the size. Even at >8% (base model), it would establish the sub-1B category on the board.

Submission process:

Run bigcode-evaluation-harness (Python tool — the one exception to our zero-Python rule, because it is the leaderboard’s own eval framework)
Standard params: top-p=0.95, temperature=0.2, n_samples=50, max_length_generation=512
Submit PR to community_results/PAIML_ALBOR350M_noahgift/
Include: scores JSON, generations folder, metrics folder
Results appear as “non-verified” (community submission)

E.2 Why NOT Other Leaderboards

Open LLM Leaderboard v2: Benchmarks (IFEval, BBH, MATH L5, GPQA, MuSR, MMLU-PRO) were designed for models >7B. A 350M model scores near random on MATH Level 5 (~0%), GPQA (~25%), and MMLU-PRO (~10%). Waste of eval compute.

EvalPlus Leaderboard: Uses HumanEval+ and MBPP+ (80x more tests than vanilla HumanEval). Secondary submission target if Big Code results are strong. Currently no sub-1B models either. URL: https://evalplus.github.io/leaderboard.html

BigCodeBench Leaderboard: 1,140 software-engineering tasks. Designed for 7B+ models. A 350M model would score near zero. Not appropriate.

E.3 General Capability Eval (Not a Leaderboard — Internal Only)

ARC-Easy, HellaSwag, PIQA, LAMBADA are the standard for sub-1B general model comparison (Pythia, OPT, GPT-2 all publish on these). We evaluate on them for internal comparison, but they have no dedicated leaderboard worth targeting. Code benchmarks are the real scoreboard.

E.4 FIM Evaluation

There is no canonical FIM benchmark. SantaCoder used a custom FIM evaluation; other models use MultiPL-E or proprietary internal evals. Albor will define its own FIM evaluation protocol (exact match on held-out Python functions) and report absolute numbers rather than targeting a specific percentage.

E.5 Falsification Risks for the Leaderboard Targets

MoE→Dense distillation gap: No published work demonstrates distilling an 80B MoE model into a 350M dense model. The architecture mismatch (DeltaNet+MoE routing → vanilla LLaMA) may limit knowledge transfer. If distillation gains are <2 points on HumanEval, the “Good” success criterion is at risk.
Teacher inference bottleneck: At ~2-5 tok/s (fp16 on Xeon), producing 2B tokens of teacher logits takes ~12 days. If 500M tokens of logits proves insufficient, the timeline extends by weeks.
Rust training stack maturity: entrenar has never trained a model from scratch at 350M scale. Bugs in gradient accumulation, mixed precision, or checkpointing could cause silent correctness issues that only surface as poor benchmark scores.
Data quality ceiling: The local ground truth corpora (~71K files) are high quality but narrow. If the BPE tokenizer or data mix doesn’t generalize well to HumanEval-style problems, the base model ceiling is lower than projected.
bigcode-evaluation-harness compatibility: The leaderboard eval tool is Python-based and expects HuggingFace-format models. Our SafeTensors export must be compatible with the harness’s model loading. If not, we need a thin adapter — this is a potential gap not yet tracked.

E.6 The Real Story

“A Python code completion model that was trained entirely in Rust with zero Python dependencies — from data pipeline to on-device inference.” The irony is deliberate: a Rust ML stack producing a Python code assistant. The model is the proof; the stack is the lasting value. Publishable regardless of exact benchmark numbers.

Appendix F: Dogfooding Log

Living record of tool validation against the Albor repo. Updated as gaps are discovered and resolved.

Summary (2026-03-04)

Tool	Command	Result	Gap
`pv validate`	`pv validate contracts/*.yaml`	PASS (all 12 contracts)	—
`pv coverage`	`pv coverage contracts`	PASS (100% obligation coverage)	—
`pv graph`	`pv graph contracts`	PASS (8 nodes, correct deps)	—
`pv probar`	`pv probar contracts/*.yaml`	PASS (generates property tests)	—
`pv kani`	`pv kani contracts/*.yaml`	PASS (generates Kani harnesses)	—
`pv generate`	`pv generate contracts/*.yaml`	PASS (20 files: scaffold, kani, probar, book)	—
`pv scaffold`	`pv scaffold contracts/*.yaml`	PASS (Rust trait + test stubs)	—
`pv status`	`pv status contracts/*.yaml`	PASS (equation/obligation counts)	—
`pv audit`	`pv audit contracts/*.yaml`	PASS (no findings)	—
`pv equations`	`pv equations contracts/*.yaml`	PASS (formatted equations)	—
`pv book`	`pv book contracts/`	PASS (7 mdBook pages)	—
`pv lean`	`pv lean contracts/*.yaml`	INFO (needs `lean:` metadata blocks)	—
`forjar validate`	`forjar validate -f infra-only.yaml`	PASS (2 machines, 6 resources)	—
`forjar validate`	`forjar validate -f albor.yaml`	PASS (2 machines, 22 resources)	~~ALB-027~~ FIXED
`forjar graph`	`forjar graph -f infra-only.yaml`	PASS (Mermaid output)	—
`apr finetune --plan`	`apr finetune --plan --model-size 350M --vram 24`	PASS (VRAM estimate correct)	—
`apr train plan --task pretrain`	`apr train plan --task pretrain --config pretrain-350m.yaml`	PASS (validates config, shows arch/params)	~~ALB-009~~ FIXED
`apr distill --plan`	`apr distill --plan`	PASS (file-based mode)	—
`apr distill --config --plan`	`apr distill --config distill-entrenar.yaml --plan`	PASS (validates config, shows two-stage workflow)	~~ALB-011~~ FIXED
`apr distill --config --plan --json`	`apr distill --config distill-entrenar.yaml --plan --json`	PASS (structured JSON with verdict)	~~ALB-011~~ FIXED
`apr distill --config --stage precompute`	`apr distill --config distill-entrenar.yaml --stage precompute`	PASS (inspects teacher, 290 tensors, writes manifest)	~~ALB-011~~ FIXED
`apr distill --config --stage train`	`apr distill --config distill-entrenar.yaml --stage train`	PASS (reads manifest, validates, sets up KD)	~~ALB-011~~ FIXED
`apr train apply --parquet`	`apr train apply --task pretrain --config pretrain-parquet.yaml`	PASS (8 rows from Parquet, 4 batches, CUDA training)	~~ALB-007~~ FIXED
`apr quantize --plan`	`apr quantize --plan <file>`	PASS (plan mode works)	—
`apr prune --plan`	`apr prune --plan <file>`	PASS (plan mode exists)	—
`alimentar quality profiles`	`alimentar quality profiles`	PASS (ml-training profile exists)	—
`alimentar import`	`alimentar import local <in> -o <out>`	PASS (local import works)	~~ALB-019~~ FIXED
`alimentar mix`	`alimentar mix a.parquet:0.8 b.parquet:0.2 -o out.parquet`	PASS (weighted sampling + upsampling)	~~ALB-020~~ FIXED
`apr tokenize plan`	`apr tokenize plan --data corpus.txt --vocab-size 32000`	PASS (validates corpus, estimates time)	~~ALB-001~~ FIXED
`apr tokenize apply`	`apr tokenize apply --data corpus.txt --vocab-size 100`	PASS (trains BPE, writes vocab.json + merges.txt)	~~ALB-001~~ FIXED
`alimentar fim`	`alimentar fim data.parquet -o fim.parquet --rate 0.5`	PASS (PSM/SPM FIM transform)	~~ALB-018~~ FIXED
`batuta falsify`	`batuta falsify . --format markdown`	PASS (108 checks, 73.1% score)	~~ALB-029~~ FIXED
`batuta falsify --critical-only`	`batuta falsify . --critical-only`	PARTIAL (3/5 pass, 1 fail)	~~ALB-029~~ FIXED
`batuta stack status`	`batuta stack status --simple`	PASS (11 tools detected, 5 healthy)	~~ALB-030~~ FIXED
`batuta oracle --list`	`batuta oracle --list`	PASS (lists all 40+ stack components)	—
`batuta oracle --recommend`	`batuta oracle --recommend --problem "train 350M LLM"`	PASS (recommends aprender)	—
`batuta oracle --local`	`batuta oracle --local`	PASS (47 PAIML projects discovered)	—
`batuta oracle --capabilities`	`batuta oracle --capabilities entrenar`	PASS (autograd, lora, qlora, quantization, model_merge, distillation)	—
`batuta playbook validate`	`batuta playbook validate albor-playbook.yaml`	PASS (19 stages, 14 params, acyclic DAG)	—
`batuta hf search`	`batuta hf search model "code completion"`	PARTIAL (returns placeholder/mock data)	—
`bashrs make lint`	`bashrs make lint Makefile`	PASS (2 warnings, 0 errors)	—
`bashrs make parse`	`bashrs make parse Makefile`	PASS (full AST)	—
`bashrs make purify`	`bashrs make purify Makefile`	PASS (purified output)	—
`bashrs classify`	`bashrs classify Makefile`	PASS (safe: 85%)	—
`apr pipeline validate`	`apr pipeline validate albor.yaml`	PASS (2 machines, 22 resources)	~~ALB-028~~ FIXED
`apr pipeline plan`	`apr pipeline plan albor.yaml`	PASS (23 resources, full DAG)	~~ALB-028~~ FIXED
`apr pipeline plan --json`	`apr pipeline plan albor.yaml --json`	PASS (structured JSON with deps)	~~ALB-028~~ FIXED
`apr pipeline status`	`apr pipeline status albor.yaml`	EXPECTED FAIL (no state dir yet)	—
`pmat query`	`pmat query "training"`	PASS (0 functions, 5 document matches)	—
`pmat analyze makefile`	`pmat analyze makefile Makefile`	PASS (64% quality score)	—
`pv lean`	`pv lean contracts/kd-v1.yaml`	PASS (6 Lean 4 theorem stubs generated)	—
`pv lean-status`	`pv lean-status contracts/`	PASS (0% L4 coverage, 4 sorry debt)	—
`apr train plan --task classify`	`apr train plan --data <JSONL>`	PASS (classification fine-tuning)	—
`apr merge`	`apr merge --strategy slerp`	PASS (SLERP, TIES, DARE supported)	—
`apr export --list-formats`	`apr export --list-formats`	PASS (SafeTensors, GGUF, MLX)	—
`apr publish`	`apr publish <dir> <repo>`	PASS (HF Hub publish exists)	—
`apr eval`	`apr eval <model>`	PASS (perplexity eval)	—
`apr eval --task code`	`apr eval model --task code --data bench.jsonl`	PASS (pass@1 scoring, 10/10 on basic set)	~~ALB-006~~ FIXED
`apr eval --task plan`	`apr eval model --task plan --data bench.jsonl`	PASS (dry-run validation)	~~ALB-006~~ FIXED
`alimentar mix` (test)	`alimentar mix ...parquet:0.25 -o test.parquet -n 200 --seed 456`	PASS (200 rows, 50 per corpus)	—
`alimentar fim` (prod)	`alimentar fim mixed.parquet -o mixed-fim.parquet --rate 0.5 --format psm`	PASS (17,070 rows, PSM FIM 50%)	—
`apr tokenize apply` (prod)	`apr tokenize apply --data corpus-raw.txt --vocab-size 32768 --algorithm bpe -o tokenizer/ --max-lines 100000`	PASS (32,768 vocab, 2022.5s, 8/8 Python patterns)	~~ALB-001~~ FIXED
`alimentar quality`	`alimentar quality profiles`	PASS (ml-training profile)	—
`alimentar convert`	`alimentar convert`	PASS (format conversion)	—
`bashrs score`	`bashrs score Makefile`	PASS (D grade, 5.2/10)	—
`bashrs audit`	`bashrs audit Makefile`	PASS (comprehensive audit)	—
`entrenar train` (50M)	`entrenar train pretrain-50m-test.yaml`	PASS (demo batches, 465ms, loss 10.34→9.67)	ALB-033 (tokenizer format)
`apr train apply` (50M)	`apr train apply --task pretrain --config pretrain-50m-test.yaml`	PASS (10-row micro, 5 batches, 2.1s CUDA)	~~ALB-034~~ FIXED
`apr train apply` (50M full)	`apr train apply --task pretrain --config pretrain-50m.yaml`	PASS (500 rows, 125 batches, 31 steps, 110.7s CUDA, loss 10.3→4.42)	~~ALB-034~~ FIXED
`apr train apply` (50M v2)	`apr train apply --task pretrain --config pretrain-50m-v2.yaml`	PASS (pre-tokenized ByteLevel BPE, 108.5s CUDA, loss→5.51)	—
`apr train plan` (350M)	`apr train plan --task pretrain --config pretrain-350m.yaml`	PASS (config validated, ready for apply)	—
`entrenar validate`	`entrenar validate pretrain-350m-manifest.yaml`	PASS (architecture overrides bridge through)	~~ALB-021~~ FIXED
`entrenar shorthand`	`vocab_size: "32K"` in YAML manifest	PASS (parses to 32768)	~~ALB-022~~ FIXED
`apr merge --plan`	`apr merge a.apr b.apr --plan --strategy slerp -o merged.apr`	PASS (validates inputs, shows strategy, sizes)	~~ALB-023~~ FIXED
`apr export --plan`	`apr export model.apr --plan --format gguf -o model.gguf`	PASS (validates format, shows plan)	~~ALB-023~~ FIXED
`apr publish --plan`	`apr publish dir repo --plan`	PASS (alias for –dry-run)	~~ALB-023~~ FIXED
`apr train apply` (350M full)	`apr train apply --task pretrain --config pretrain-350m.yaml`	FAIL (ALB-060: epochs=1 exhausted data at step 43/5000, loss flat ~10.39, LR still in warmup at 6.45e-6)	ALB-060
`apr train apply` (350M v2)	`apr train apply --task pretrain --config pretrain-350m-v2.yaml`	PASS (ALB-065 fixed: `stream.synchronize()` before D2H gradient transfers. Training stable without `CUDA_LAUNCH_BLOCKING=1`, 441 tok/s)	~~ALB-064~~ ~~ALB-065~~ FIXED
`train-guard.sh`	`bash scripts/train-guard.sh configs/train/pretrain-350m-v2.yaml`	PASS (crash-resilient supervisor with auto-diagnostic CUDA blocking mode, exit code classification, GPU state capture, JSON crash reports, backoff restart, heartbeat monitoring)	~~ALB-064~~ FIXED
`pv validate` (memory)	`pv validate contracts/training-memory-kernel-v1.yaml`	PASS (0 errors, 0 warnings)	ALB-039
`pv validate` (GPU)	`pv validate contracts/training-gpu-kernel-v1.yaml`	PASS (0 errors, 0 warnings)	ALB-040
`apr train apply` (50M CUDA)	`apr train apply --config pretrain-50m-v2-test.yaml`	PASS (3 steps, loss 10.4→11.7, GPU forward+backward)	~~ALB-041~~ FIXED
`apr eval` (50M safetensors)	`apr eval checkpoints/albor-base-50m/model.safetensors --dataset custom`	FAIL (PPL 679,614 — weights ignored)	~~ALB-037~~ FIXED
`apr train apply` (350M CUDA test)	`apr train apply --config pretrain-350m-cuda-test.yaml`	PASS (50 steps, ~400s, loss 10.39→5.92, best 5.53, checkpoint saved)	~~ALB-043~~ ~~ALB-044~~ ~~ALB-059~~ FIXED
`realizar run` (350M)	`realizar run checkpoints/albor-350m-cuda-test/model.safetensors "def fibonacci(" --raw`	PASS (218 tensors loaded, 50 tokens generated, 1.0 tok/s)	~~ALB-037~~ FIXED
`eval-perplexity.py` (350M validate)	`python scripts/eval-perplexity.py checkpoints/albor-350m-cuda-test/ --validate-checkpoint`	PASS (weights trained, layers distinct)	—
`eval-perplexity.py` (350M perplexity)	`python scripts/eval-perplexity.py checkpoints/albor-350m-cuda-test/ --data val.parquet --max-sequences 3 --seq-len 64`	PASS (PPL 31,926 — finite, consistent with 50-step model)	—
`eval-code.py` (validate)	`python scripts/eval-code.py configs/eval/python-intermediate.jsonl --validate-only`	PASS (15/15 canonical solutions)	—
`eval-code.py` (HumanEval)	`python scripts/eval-code.py configs/eval/humaneval-subset.jsonl --validate-only`	PASS (20/20 canonical solutions)	—
`convert-checkpoint.py` (50M)	`python scripts/convert-checkpoint.py checkpoints/albor-base-50m/`	PASS (110→111 tensors, 85 reshaped, lm_head created)	ALB-037
`eval-perplexity.py --validate`	`python scripts/eval-perplexity.py checkpoints/albor-base-50m/ --validate-checkpoint`	FAIL → FIXED (ALB-038 root cause in autograd)	~~ALB-038~~ FIXED
checkpoint analysis	byte-compare layers 0-11 q_proj, gate_proj	FAIL → FIXED (all parameters now receive gradients)	~~ALB-038~~ FIXED
`apr monitor` (TUI)	`apr monitor checkpoints/albor-base-350m/`	PASS (presentar TUI, live GPU telemetry, loss curve, tok/s)	~~ALB-045~~ ~~ALB-046~~ ~~ALB-047~~ ~~ALB-048~~ FIXED
`apr monitor --json`	`apr monitor --json checkpoints/albor-base-350m/`	PASS (headless JSON with full TUI parity)	~~ALB-053~~ ~~ALB-058~~ FIXED
`apr monitor` (discover)	`apr monitor` (no args)	PASS (discovers active runs from global SQLite registry)	~~ALB-054~~ FIXED
`apr train apply` (SQLite)	`apr train apply --config pretrain-50m-quick.yaml`	PASS (creates both local + global experiments.db, logs params + metrics)	~~ALB-055~~ ~~ALB-056~~ FIXED
`apr runs ls --global`	`apr runs ls --global`	PASS (table output: experiment, run ID, status, loss, tok/s, duration)	~~ALB-050~~ FIXED
`apr runs ls --global --json`	`apr runs ls --global --json`	PASS (JSON array with all run metadata)	~~ALB-050~~ FIXED
`apr runs show`	`apr runs show <id> --global`	PASS (params, loss, tok/s, lr, duration)	~~ALB-050~~ FIXED
`apr runs show --json`	`apr runs show <id> --global --json`	PASS (clean JSON with native param values)	~~ALB-050~~ FIXED
`realizar run` (350M v2)	`realizar run checkpoints/albor-350m-cuda-test/model.safetensors "def fibonacci("`	PASS (24 layers, 32768 vocab, 50 tokens, 1.9 tok/s, garbage output expected from 5-step model)	—
`pv audit` (all)	`pv audit contracts/*.yaml` (7 contracts)	PASS (0 findings, 22 equations, 43 obligations, 26 falsification tests)	—
`batuta falsify --critical-only`	`batuta falsify . --critical-only`	PARTIAL (3/5 pass, 80.0% score, AI-01/AI-05 partial)	—
`apr runs diff`	`apr runs diff <a> <b> --global`	PASS (side-by-side sparklines, config diff, loss comparison, verdict)	~~ALB-051~~ FIXED
`apr runs diff --json`	`apr runs diff <a> <b> --global --json`	PASS (structured JSON: summaries, config_diff, verdict for LLM agents)	~~ALB-051~~ FIXED
`apr monitor` (widget composition)	`TrainingDashboard` composes `Layout`, `Border`, `Meter`, `GpuPanel`, `Sparkline`, `Text`	PASS (builds clean, widget tree rebuilt each frame, panel verification wired)	~~ALB-057~~ FIXED
`apr experiment view --global --json`	`apr experiment view --global --json`	PASS (JSON output with experiments, run_ids, loss_values, params from SQLite)	~~ALB-024~~ FIXED
`apr experiment view --global`	`apr experiment view --global`	PASS (ratatui TUI: run table, sparkline, braille loss chart, j/k navigation)	~~ALB-024~~ FIXED
`pv validate` (training-config)	`pv validate contracts/training-config-kernel-v1.yaml`	PASS (0 errors, 8 obligations, 5 falsification tests, 2 Kani harnesses)	ALB-060
`pv coverage` (all 8 contracts)	`pv coverage contracts/`	PASS (8 contracts, 31 equations, 51 obligations, 34 falsification tests, 100% coverage)	—
`apr train apply` (50M post-fix)	`apr train apply --config pretrain-50m-quick.yaml`	PASS (5 steps, loss 10.42→9.45, GEMM backward now correct)	~~ALB-059~~ FIXED
`apr train apply` (350M post-fix)	`apr train apply --config pretrain-350m-cuda-test.yaml`	PASS (50 steps, loss 10.39→5.92, best 5.53, zero NaN, correct backward gradients)	~~ALB-059~~ FIXED
`realizar run` (350M post-fix)	`realizar run checkpoints/albor-350m-cuda-test/model.safetensors "def fibonacci("`	PASS (218 tensors, generates tokens from correctly-trained weights)	~~ALB-059~~ FIXED
`apr quantize` (50M int4)	`apr quantize model.safetensors -s int4`	PASS (238 MiB → 30 MiB, 87.5% reduction, 7.99x)	—
`apr quantize` (50M q4k)	`apr quantize model.safetensors -s q4k`	PASS (238 MiB → 238 MiB, 0% reduction — q4k no-op on 1D tensors)	—
`apr quantize` (350M int4)	`apr quantize model.safetensors -s int4`	PASS (1.48 GiB → 191 MiB, 87.5% reduction, 7.99x)	—
`apr quantize` (350M q4k)	`apr quantize model.safetensors -s q4k`	PASS (1.48 GiB → 1.48 GiB, 0% reduction — q4k no-op on 1D tensors)	—
`apr prune` (50M magnitude)	`apr prune model.safetensors --method magnitude --sparsity 0.5`	PASS (50.0% zeros, 31.2M/62.4M params zeroed)	—
`apr prune` (50M depth)	`apr prune model.safetensors --method depth --remove-layers "8-11"`	PASS (110→74 tensors, 238→180 MiB, layers 8-11 removed)	—
`apr prune` (350M magnitude)	`apr prune model.safetensors --method magnitude --sparsity 0.3`	PASS (50.0% zeros — sparsity param may be ignored)	—
`source-to-parquet.py` (Tier 2)	`python scripts/source-to-parquet.py ~/src/pytorch pytorch data/parquet/tier2/pytorch.parquet`	PASS (8 repos → 28,553 Python files imported)	—
`alimentar mix` (expanded)	`alimentar mix ...T1:10.0 ...T2:1.0 -o mixed.parquet --seed 42`	PASS (12 datasets → 45,420 rows, proportional weighted sampling)	—
`alimentar fim` (expanded)	`alimentar fim mixed.parquet -o mixed-fim.parquet --rate 0.5 --format psm`	PASS (45,420 rows, 50% PSM FIM)	—
`pretokenize.py` (v2)	`python scripts/pretokenize.py --input mixed-fim.parquet --seq-len 2048`	PASS (67,977 sequences, 139M tokens, 191 MiB)	—
`realizar run` (0.5B teacher)	`realizar run qwen2.5-coder-0.5b/model.safetensors "def fibonacci("`	PASS (24 layers, 151936 vocab, 2.8 tok/s, generates tokens)	—
`apr distill --stage precompute` (0.5B)	`apr distill --config distill-entrenar.yaml --stage precompute`	PASS (290 tensors, 942 MiB, manifest written)	—
`apr distill --stage precompute` (3B)	`apr distill --config distill-qwen3b.yaml --stage precompute`	PASS (434 tensors, 5.75 GiB, sharded SafeTensors loaded)	—
`realizar run` (3B sharded)	`realizar run qwen2.5-coder-3b/model-00001-of-00002.safetensors`	FAIL (sharded SafeTensors not supported — model.norm.weight in shard 2)	—
C-TRAINCFG-001 pre-flight (v2)	`python3 -c "..."` (algebraic check)	PASS (67977 seqs, 132 steps/epoch, 38 epochs, warmup=500=10%)	ALB-060
`alimentar dedup`	`alimentar dedup data.parquet -o dedup.parquet`	PASS (exact dedup by text column, found 2 dups in 1843 rows)	—
`alimentar filter-text`	`alimentar filter-text data.parquet -o filtered.parquet --threshold 0.4`	PASS (composite scoring: alnum ratio, line length, dup lines, entropy)	—
`apr eval --task humaneval`	`apr eval model.safetensors --task humaneval --data humaneval.jsonl`	PASS (20/20 problems validated, pass@1/10/100 metrics, JSON output)	—
`apr eval --task contamination`	`apr eval model.safetensors --task contamination --data train.jsonl`	PASS (10-gram Jaccard overlap, 0/179 contaminated)	—
`apr eval --task compare`	`apr eval model_a.safetensors --task compare --data model_b.safetensors`	PASS (side-by-side: size, tensors, format, ratio)	—
`apr train watch`	`apr train watch --config pretrain-350m-v2.yaml`	PASS (crash recovery, exponential backoff, GPU diagnostics, crash-reports JSON)	—
`apr eval --task verify`	`apr eval checkpoints/albor-350m-cuda-test/ --task verify`	PASS (9/9 checks: safetensors header, tensor count, FNV-1a hash, config.json)	—
`apr train sweep`	`apr train sweep --config base.yaml --strategy random --num-configs 5`	PASS (5 configs with log-uniform LR, batch size, weight decay, warmup)	—
`apr train archive`	`apr train archive checkpoints/albor-50m-quick/ -o /tmp/archive --version v0.1`	PASS (4 files, 238 MB, MANIFEST.json with BLAKE3 hashes)	—
`apr eval --task correlation`	`apr eval checkpoints/ --task correlation`	PASS (236 data points, Pearson r=-0.14, Spearman rho=-0.21, from loss_history)	—
`apr eval --task human` (generate)	`apr eval checkpoints/albor-350m-cuda-test/ --task human`	PASS (10-prompt ratings sheet with criteria, JSON output)	—
`apr eval --task human` (analyze)	`apr eval /tmp --task human --data test-ratings.jsonl`	PASS (mean=3.0, median=3.0, pass@3=60%, distribution histogram)	—
`apr encrypt`	`apr encrypt model.safetensors -o model.enc --key-file key.bin`	PASS (238 MB, 0.89s, BLAKE3 keystream + MAC)	—
`apr decrypt`	`apr decrypt model.enc -o model.safetensors --key-file key.bin`	PASS (238 MB roundtrip verified, MAC authenticated, 0.74s)	—
`apr train plan` (R-095)	`apr train plan --task pretrain --config pretrain-350m-cuda-test.yaml`	PASS (extended: RAM 5.5GB, disk 4.5GB/ckpt, 2048 tok/step, 60ms/step, 34K tok/s)	—
`apr train apply --distributed`	`apr train apply --task pretrain --config pretrain-350m.yaml --distributed --world-size 2`	PASS (CLI flags accepted, YAML patched with distributed section)	—
`apr train apply --deterministic`	`apr train apply --task pretrain --config pretrain-50m-quick.yaml --deterministic --seed 42`	PASS (deterministic + seed flags injected into YAML)	—
`entrenar` (activation checkpointing)	`with_checkpointing(4)` in TransformerTrainConfig	PASS (checkpoint boundary mask, segment-based recomputation, 4 unit tests)	~~#115~~ FIXED
`entrenar` (gradient accumulation)	`with_accumulation_steps(4)` in CudaTransformerTrainer	PASS (per-block CPU accum, download workspace D2H, average + upload H2D + optimizer, 2 unit tests)	~~#131~~ FIXED
`pv validate` (distributed)	`pv validate contracts/C-DDP-001.yaml contracts/C-RING-001.yaml contracts/C-SHARD-001.yaml contracts/C-WIRE-002.yaml`	PASS (4 new contracts, 0 errors)	—
`entrenar` (distributed DDP)	4-worker ring AllReduce, per-block reverse-order AllReduce	PASS (C-DDP-001 weight consistency via BLAKE3, 11 integration tests)	~~#145~~ FIXED
`entrenar` (comm-overlap)	AllReduce + computation overlap timing test	PASS (overlap ≤ sequential time, concurrent threads)	~~#145~~ FIXED
`entrenar` (multi-node)	3-node checkpoint coordination, block gradient exchange	PASS (barrier sync lifecycle, concurrent AllReduce + checkpoint)	~~#145~~ FIXED
`entrenar` (heterogeneous)	detect_all_devices(), mixed-backend AllReduce	PASS (CUDA+wgpu+CPU workers produce identical averaged gradients)	~~#145~~ FIXED
`apr train apply` (350M ALB-069)	`apr train apply --config pretrain-350m-cuda-test.yaml` (post-selp fix)	PASS (5 steps, loss 10.42→10.13, fused CE kernel produces non-zero loss)	~~ALB-069~~ FIXED
`apr train apply` (350M ALB-070)	`apr train apply --config pretrain-350m-v2.yaml` (save_interval fix)	PASS (save_interval=250 works, eval_batch truncates to max_seq_len)	~~ALB-070~~ FIXED
`apr train apply` (350M ALB-071)	`apr train apply --config pretrain-350m-cuda-test.yaml` (embed clip fix)	PASS (5 steps, embed grad clipped with unwrap_or(1.0), no NaN)	~~ALB-071~~ FIXED
`apr train apply` (350M ALB-072 FP32)	`apr train apply --config pretrain-350m-fp32-test.yaml`	PASS (5 steps, all 218 tensors OK, gnorm=2.29, FP32 baseline)	—
`apr train apply` (350M ALB-072 FP16)	`apr train apply --config pretrain-350m-cuda-test.yaml` (loss scale fix)	PASS (50 steps, all 218 tensors OK, gnorm matches FP32 baseline, zero NaN)	~~ALB-072~~ FIXED
`apr train apply` (350M v2 full)	`apr train apply --config pretrain-350m-v2.yaml` (all fixes)	CRASHED step 1183/5000. Loss 10.40→6.85. ALB-073 (PTX selp) + ALB-074 (stale binary buffer overflow). Step 1000 checkpoint saved.	ALB-063
`apr train apply` (binary verify)	`apr train apply --config pretrain-350m-cuda-test.yaml` (rebuilt binary)	PASS (5 steps, loss=10.40, gnorm=2.29, no PTX errors, no buffer overflow)	~~ALB-073~~ ~~ALB-074~~ FIXED
codeparrot download	`scripts/download-codeparrot.py --max-rows 2000000`	PASS (2M files, 20 shards, 6.1 GB, ~4.4B tokens, 99.2% filter pass rate, 499s)	Data scaling
pretokenize v3	`scripts/pretokenize.py --shard-output --seq-len 1024`	IN PROGRESS (20 shards, ~260K seqs/shard, ~266M tokens/shard)	Data scaling

ALB-060: Training Config Epoch/Step Mismatch (Critical)

Discovery: The 350M “full training” run completed in 11.8 seconds instead of the expected 12+ hours, producing an effectively untrained model.

Five Whys (per CLAUDE.md Rule 7):

Why did loss stay flat at ~10.39? The learning rate never reached a meaningful value — max LR achieved was 6.45e-6 vs target 3e-4.
Why was LR so low? The warmup schedule is linear over 2000 steps, but training only ran 43 steps. At step 43: lr = 3e-4 × (43/2000) = 6.45e-6.
Why only 43 steps? steps_per_epoch = floor(22079 / 4 / 128) = 43. With epochs: 1, total achievable steps = 43. max_steps: 5000 is unreachable.
Why only 1 epoch? The config comment says “Pre-training uses max_steps, not epochs” but entrenar’s training loop respects epochs as a hard cap — it does NOT loop data to fill max_steps.
Why no validation? No pre-flight check computes steps_per_epoch and compares against max_steps + warmup_steps. The algebraic inconsistency is invisible.

Algebraic proof (from C-TRAINCFG-001 contract):

num_sequences       = 22,079
micro_batch_size    = 4
grad_accum_steps    = 128
steps_per_epoch     = floor(22079 / 4 / 128) = 43
total_achievable    = 1 × 43 = 43
max_steps           = 5,000       ← UNREACHABLE
warmup_steps        = 2,000       ← NEVER COMPLETES
tokens_trained      = 43 × 4 × 128 × 1024 = 22.5M
chinchilla_min      = 10 × 370M = 3.7B   ← undertrained by 164×

Fix required (two options):

Set epochs: 117 (ceil(5000/43)) to cycle data 117 times → reaches 5031 steps
Add epoch-looping to entrenar: when max_steps is set and epochs exhausted, reshuffle data and continue (treats max_steps as authoritative, epochs as informational)

Contract: contracts/training-config-kernel-v1.yaml (C-TRAINCFG-001) with 7 equations, 8 proof obligations, 5 falsification tests, 2 Kani harnesses. FALSIFY-CFG-001 and FALSIFY-CFG-002 algebraically prove this config is invalid.

Training state.json analysis: The loss_history array (55 entries, all ~10.39-10.40) and learning_rate: 0.0 confirm the model never learned. The status: "Running" field is stale (training completed but status was not updated to “Completed” — minor bug).

Secondary bug: The training log displays loss=0.0000 for every step despite training_state.json recording real loss values ~10.39. This is the known ALB-042 display bug (loss=0.0 reporting).

Contract Validation Detail

All 8 contracts pass pv validate with 0 errors. The original 5 were rewritten from a custom schema to match pv’s schema (metadata:, formula:, proof_obligations:, falsification_tests:). The two training kernel contracts (ALB-039, ALB-040) and the training config contract (ALB-060) were written directly in the correct schema.

pv coverage contracts
---------------------
Contracts:            8
Equations:            31
Obligations:          51
Falsification tests:  34
Kani harnesses:       10
Overall coverage:     100.0%

pv generate Detail

pv generate produces 4 files per contract (28 total):

Type	Content	Example
`*_scaffold.rs`	Rust trait with documented invariants	`knowledge-distillation-kernel-v1_scaffold.rs`
`*_probar.rs`	Property tests derived from proof obligations	6 property tests + 5 falsification test stubs
`*_kani.rs`	Kani verification harnesses	2 harnesses with `stub_float` strategy
`*_book.md`	mdBook page with equations, deps, obligations	Mermaid dependency graph, LaTeX equations

pv book contracts/ generates 7 contract pages directly into mdBook format. These have been integrated into the albor mdBook under “Kernel Contracts”.

Pipeline Manifest Validation Detail

The full pipeline manifest (configs/pipeline/albor.yaml) now passes forjar validate after the ALB-027 fix added the task resource type:

forjar validate -f configs/pipeline/albor.yaml
OK: albor-training-pipeline (2 machines, 22 resources)

Forjar supports all 13 resource types: package, file, service, mount, user, docker, pepita, network, cron, recipe, model, gpu, task.

The task resource type is the key piece that turns forjar from an infrastructure tool into a pipeline orchestrator — it runs arbitrary commands with idempotency tracking via output artifact hashing.

Spec Correction: `names` to `packages`

Dogfooding revealed that the spec used names: for forjar package resources, but forjar expects packages:. Also requires provider: apt (not implicit). Both the spec and configs were corrected.

Batuta Playbook Detail

Created configs/pipeline/albor-playbook.yaml – a batuta playbook that expresses the full albor ML pipeline as a 19-stage deterministic DAG with BLAKE3 caching:

batuta playbook validate configs/pipeline/albor-playbook.yaml
Playbook 'albor-training-pipeline' is valid
  Stages: 19
  Params: 14

Stages: validate-contracts, validate-configs, data-download, data-tokenize, data-mix, pretrain, eval-base, teacher-logits, distill, eval-distill, finetune, eval-sft, merge, eval-merged, prune, eval-pruned, quantize, eval-q4, publish.

This playbook is the actual executable pipeline (once upstream gaps are resolved). The forjar manifest handles infrastructure; the batuta playbook handles ML orchestration.

Batuta Falsification Detail (Full Report)

batuta falsify . --format markdown runs 108 checks across 10 categories:

Category	Passed	Failed	Partial	Total
Numerical Reproducibility	13	0	2	15
Jidoka Automated Gates	4	5	1	10
Architectural Invariants	1	3	1	5
Performance & Waste Elimination	7	0	8	15
ML Technical Debt Prevention	2	1	7	10
Hypothesis-Driven Development	5	0	8	13
Sovereign Data Governance	12	0	3	15
Cross-Platform & API	2	0	3	5
Safety & Formal Verification	5	1	4	10
Model Cards & Auditability	3	0	7	10

Before ALB-029 fix: Score 72.2% (58 pass, 10 fail, 40 partial).

After ALB-029 fix: Score 73.1% (55 pass, 5 fail, 48 partial).

Upstream fixes resolved AI-01 (configs/ glob), AI-04 (book-output/ exclusion), and AI-05 (non-Rust schema detection via pv/forjar). Full report saved to docs/falsification-report.md.

bashrs Makefile Linting Detail

bashrs make lint is the sovereign Makefile linter – it validates Makefile quality, safety, and best practices:

bashrs make lint Makefile
  MAKE010: Command 'rm' missing error handling
  MAKE015: Missing .DELETE_ON_ERROR
bashrs classify Makefile
  safe: 85.0%

Both warnings were addressed. bashrs also provides:

bashrs make parse – full Makefile AST
bashrs make purify – deterministic + idempotent Makefile output
bashrs classify – safety classification with multi-label support

apr train plan/apply Detail

apr train plan/apply exists but is currently scoped to classification fine-tuning with HPO (Tree-of-Parzen Estimators):

Current:  apr train plan --data <JSONL> --model-size 0.5B --task classify
Target:   apr train plan configs/train/pretrain-350m.yaml

The plan/apply infrastructure is solid – apr train plan generates structured summaries with resource estimates. The gap (ALB-009) is in scope: extending from classification to causal LM pre-training, and from flag-driven to config-file-driven.

Upstream Fixes Implemented

Dogfooding cycle 2 identified gaps that were fixed upstream and verified:

ALB-029: batuta falsify false positives (FIXED)

Three fixes in batuta/src/falsification/:

AI-01: Added configs/** glob pattern (plural) alongside config/** in invariants.rs
AI-04: Added book-output/ to JS exclusion list in is_excluded_js_path()
AI-05: Extended detect_schema_deps() to detect non-Rust validation:
- pv/forjar validation commands in Makefile and CI configs
- Python validation libs (pydantic, marshmallow, cerberus)
- pv contracts (YAML with proof_obligations: key)

Commit: batuta@905a862 → Score improved from 72.2% to 73.1%.

ALB-030: batuta stack status without Cargo.toml (FIXED)

DependencyGraph::from_workspace() now falls back to binary detection when no Cargo.toml exists. Discovers installed PAIML binaries via which, extracts versions from --version output.

Commit: batuta@371557a → batuta stack status works in albor.

ALB-019: alimentar import subcommand (FIXED)

Made Import command always available (not feature-gated behind hf-hub). Added alimentar import local <input> -o <output> for local file import with format conversion (CSV, JSON, JSONL, Parquet).

Commit: alimentar@265541b → alimentar import local works.

ALB-020: alimentar mix subcommand (FIXED)

Added alimentar mix with weighted sampling and upsampling. Supports file:weight syntax for weighted input, deterministic seeding, and efficient Arrow batch processing with arrow::compute::take.

Commit: alimentar@64b1e92 → alimentar mix works.

ALB-001: apr tokenize plan/apply (FIXED)

Added apr tokenize plan/apply subcommands for BPE vocabulary training:

plan validates corpus (lines, bytes, unique chars), estimates training time
apply trains BPE/WordPiece/Unigram tokenizer, writes vocab.json + merges.txt
Supports text, JSON, and YAML output formats for plan

Commit: aprender@90427205 → apr tokenize plan/apply works.

ALB-018: Fill-in-the-Middle (FIM) data transform (FIXED)

Added alimentar fim subcommand and Fim transform implementing PSM/SPM FIM formats (Bavarian et al. 2022). Features:

Configurable FIM rate (probability per row)
PSM and SPM format variants
Custom sentinel tokens (<|fim_prefix|>, <|fim_suffix|>, <|fim_middle|>)
Deterministic with seed, respects char boundaries
Rows below min_chars threshold left unchanged
10 unit tests

Commit: alimentar@290582d → alimentar fim works.

ALB-021: Custom model architecture params in YAML (FIXED)

Added ArchitectureOverrides to ModelRef in entrenar’s config schema. The bridge converter (manifest_to_spec) now maps YAML manifest architecture: fields to overrides that are applied on top of the resolved TransformerConfig (from config.json or demo defaults).

Supported override fields: hidden_size, num_hidden_layers, num_attention_heads, num_kv_heads, intermediate_size, vocab_size, max_position_embeddings, rms_norm_eps, rope_theta, use_bias.

The YAML manifest ArchitectureConfig also gained serde aliases (num_hidden_layers → num_layers, num_attention_heads → num_heads, num_key_value_heads → num_kv_heads, max_position_embeddings → max_seq_length) for compatibility with HuggingFace config.json field names.

Commit: entrenar@a414861 → Architecture overrides work end-to-end.

ALB-022: Human-readable value shorthand in YAML configs (FIXED)

Added shorthand module with parse_human_usize() and deserialize_human_usize_opt custom serde deserializer. Supports:

SI suffixes (binary): 32K (32×1024), 1M (1×1024²), 1G (1×1024³)
SI suffixes (decimal): 10B (10×10⁹), 1T (1×10¹²)
Scientific notation: 1e6, 3.2e4
Fractional suffixes: 1.5K (1536)
Plain numbers: 1024, 32768
YAML underscore notation: 32_768 (already native)

K/M/G use binary (powers of 2) since they’re used for model dimensions. B/T use decimal since they’re used for token/parameter counts.

Applied to ArchitectureConfig fields (hidden_size, num_layers, num_heads, num_kv_heads, intermediate_size, vocab_size, max_seq_length) and DataConfig fields (seq_len, max_length).

Commit: entrenar@1cb0950 → Shorthand deserialization works.

ALB-006: apr eval benchmark harness (FIXED)

Added --task code for code completion benchmarks and --task plan for dry-run validation to apr eval. Code evaluation uses JSONL format:

{"task_id": "add", "prompt": "def add(a, b):\n", "test": "assert add(1, 2) == 3", "canonical_solution": "    return a + b\n"}

Reports pass@1 rate with per-problem PASS/FAIL breakdown. JSON output mode supported for CI integration.

Phase 1 (current): validates benchmark structure, checks canonical solutions. Phase 2 (requires ALB-009 inference): generates completions via realizar engine.

Sample benchmark: configs/eval/python-basic.jsonl (10 problems).

Commit: aprender@4e61297e → apr eval --task code works.

ALB-009: apr train plan/apply for causal LM pre-training (FIXED)

Extended apr train plan/apply from classification-only to support causal LM pre-training via YAML config files:

apr train plan --task pretrain --config <yaml>: Loads config via entrenar::config::load_config(), validates with validate_config(), displays model architecture, data config, optimizer, and training params. JSON output supported for CI integration.
apr train apply --task pretrain --config <yaml>: Calls entrenar::config::train_from_yaml() which routes to TransformerTrainer with CausalLMLoss for next-token prediction training.

The albor pretrain config (configs/train/pretrain-350m.yaml) was updated to match entrenar’s TrainSpec schema: model.path, model.mode: transformer, model.architecture overrides, training.mode: causal_lm.

Entrenar’s training infrastructure was already ~90% ready:

CausalLMLoss for next-token prediction loss
TransformerTrainer with gradient accumulation, mixed precision
TrainSpec YAML schema with ModelMode::Transformer and TrainingMode::CausalLm

The gap was in the CLI routing — apr train only accepted --task classify.

Commit: aprender@d79ed943 → apr train plan --task pretrain works.

ALB-011: apr distill config-driven two-stage workflow (FIXED)

Added --config <yaml> and --stage <precompute|train> to apr distill:

apr distill --config <yaml> --plan: Loads YAML config, validates all sections (teacher, student, distillation, training, dataset, output), checks teacher/dataset existence on disk, displays two-stage workflow instructions. JSON output supported.
apr distill --config <yaml> --stage precompute: Inspects teacher model via RosettaStone (supports SafeTensors, APR, GGUF model dirs), writes manifest.json with tensor count and model stats for stage 2.
apr distill --config <yaml> --stage train: Reads precompute manifest, validates teacher was precomputed, inspects student model, writes training metadata to student/training_metadata.json.

Local DistillYamlConfig types match entrenar’s DistillationYamlConfig schema (teacher/student model IDs, LoRA config, KD temperature/alpha, progressive/attention transfer options, training hyperparams, dataset config). Uses serde_yaml_ng for YAML parsing.

Teacher model changed from required positional to Option<PathBuf> — config mode doesn’t need the positional arg. Existing file-based distillation mode (positional teacher.apr, –student, -o) fully preserved.

Albor config: configs/train/distill-entrenar.yaml (Qwen2.5-Coder-0.5B teacher, albor-base-350m student, LoRA rank 16, T=4.0, α=0.5).

Commit: aprender@81dd4432 → All 3 config modes work (plan, precompute, train).

ALB-028: apr pipeline plan/apply/status/validate (FIXED)

Added apr pipeline subcommand wrapping forjar’s DAG engine:

apr pipeline plan <manifest>: Shows full execution plan with resource DAG, dependency ordering, and per-machine breakdown. Supports --json, --machine, --tag, --cost flags.
apr pipeline apply <manifest>: Converges resources via forjar engine. Supports --parallel, --keep-going, --machine, --tag.
apr pipeline status <manifest>: Shows converged/pending/failed state from forjar lock files.
apr pipeline validate <manifest>: Validates manifest without connecting to machines.

Implementation shells out to the forjar binary (keeping sovereign stack tools decoupled). Follows the train/tokenize plan/apply subcommand pattern.

Commit: aprender@e653d5ca → All 4 subcommands work, plan shows 23 resources across 2 machines (lambda, intel).

ALB-027: forjar task resource type (FIXED)

Added task resource type to forjar for pipeline orchestration. Three handlers:

check_script: If completion_check set, runs it (exit 0 = done). If output_artifacts set, checks all exist. Otherwise reports pending.
apply_script: Runs command with set -euo pipefail. Supports working_dir (cd before exec) and timeout (wraps with timeout N).
state_query_script: Hashes output_artifacts via b3sum for drift detection. Falls back to echoing command string if no artifacts.

Validation: command field required, timeout must be > 0 if set.

New Resource fields: output_artifacts, completion_check, timeout, working_dir. Reuses existing command field (shared with cron).

Commit: forjar@d14e633 → forjar validate -f albor.yaml passes (2 machines, 22 resources).

ALB-023: Plan/apply contract for all apr subcommands (FIXED)

Added --plan flag to the remaining action commands that lacked plan mode:

apr merge --plan: Validates input files exist, parses strategy, validates weights, shows model count and total input size. Exits 0 on valid, non-zero on error.
apr export --plan: Validates model file exists, format is supported, shows input size and target format. Supports batch mode plan.
apr publish --plan: Alias for existing --dry-run. Preview model card and file list without uploading.

Pre-dispatch contract validation (RosettaStone tensor checks) is now skipped in plan mode to allow plan on empty/placeholder files.

Full coverage audit:

Command	Plan Mode	Type
train	plan/apply subcommands	Pre-existing
tokenize	plan/apply subcommands	Pre-existing
quantize	–plan flag	Pre-existing
finetune	–plan flag	Pre-existing
prune	–plan flag	Pre-existing
distill	–plan flag	Pre-existing
eval	–task plan	Pre-existing
merge	–plan flag	New
export	–plan flag	New
publish	–plan flag	New

Commit: aprender@526a1e4b → All action commands have plan mode.

ALB-007: Parquet→LMBatch Bridge (Upstream Fix)

Gap: entrenar’s load_lm_batches_from_parquet() was a stub that returned demo data. The Parquet-to-training bridge was missing — alimentar produces Arrow RecordBatch, entrenar consumes LMBatch(Vec<u32>).

Fix (entrenar@a5a2fb7):

Text column Parquet: extracts text column → tokenizes with HfTokenizer → LMBatch
Pre-tokenized Parquet: reads input_ids/token_ids List directly → LMBatch
Directory support: iterates all .parquet shards in a directory
Column auto-detection: tries specified column, then text/content/code fallbacks
Gated behind parquet feature flag (alimentar + arrow deps)
apr-cli Cargo.toml updated to enable entrenar/parquet feature

Dogfood result:

apr train apply --task pretrain --config configs/train/pretrain-parquet.yaml

  Loading 1 Parquet shard(s) from ./data/tokenized/train/
  Loaded 8 rows from Parquet
  Extracted 8 text rows, tokenizing...
  Tokenized 8 sequences
  4 LM batches created
  Epoch 1/1: loss=12.05

apr-cli Cargo.toml: entrenar = { version = "0.7.3", features = ["cuda", "parquet"] } Commit: aprender@ (pending push)

ALB-064: Training Process Silent Death (Critical)

Discovery: 350M v2 training (2026-03-03) started successfully, logged step 0 (loss=10.3933, 11.85 GB VRAM), then silently died. No error in stdout/stderr, no crash log, no backtrace, no dmesg OOM entry. Process gone, training_state.json still shows "status": "Running". Repeated on second attempt.

Five Whys:

Why	Finding	Brick Boundary
Why did training fail?	Unknown — process exited with no output	Per-process: PID gone, GPU memory freed
Why no error output?	CUDA driver errors → SIGABRT/SIGSEGV → bypasses Rust panic handler	Per-transfer: driver crash kills process instantly
Why no crash handling?	No signal handler, no watchdog, no crash recovery	System level: no supervision infrastructure
Why no watchdog?	Training assumed to work or print errors	Architectural gap: no defensive monitoring
Why no defensive monitoring?	Pipeline lacks production process supervision	Root cause: zero crash resilience infrastructure

Fix: scripts/train-guard.sh — crash-resilient training supervisor implementing patterns from Meta (Llama 3: 466 restarts in 54 days), ByteDance (ByteRobust), Amazon (FlashRecovery), and systemd:

Feature	Implementation
Exit code classification	SIGSEGV=139→restartable, SIGKILL=137→OOM, SIGBUS=135→fatal
GPU state capture	nvidia-smi queries + Xid error detection + dmesg OOM check
Structured crash reports	JSON to `crash-reports/` with exit code, signal, GPU state, last step/loss
Exponential backoff	30s → 60s → 120s → 240s → 600s cap, reset after 1h stable
Heartbeat monitoring	Polls `training_state.json` every 15s, detects stale >300s (GPU hang)
Pre-flight checks	Kill stale GPU processes, verify GPU health, check Xid errors
Signal forwarding	SIGTERM/SIGINT forwarded to training process on guard shutdown

Debugging mode: make train-350m-raw runs with RUST_BACKTRACE=1 CUDA_LAUNCH_BLOCKING=1 to capture CUDA errors synchronously (slower but diagnostic).

Auto-diagnostic mode: train-guard.sh detects the async CUDA crash pattern (early death + signal crash at step 0) and automatically enables CUDA_LAUNCH_BLOCKING=1 on the next restart to surface the exact failing kernel.

ALB-065: Missing stream.synchronize() Before D2H Gradient Transfers (Critical)

Discovery: Diagnosed via ALB-064. Training with CUDA_LAUNCH_BLOCKING=1 was stable for 18+ minutes; without it, process died within 15 seconds. This is the classic async CUDA error pattern.

Five Whys:

Why	Finding	Brick Boundary
Why does training crash silently?	CUDA error queued asynchronously, process dies at next sync point	Per-kernel: error deferred
Why does CUDA_LAUNCH_BLOCKING=1 fix it?	Forces synchronous execution, masking a race condition	Per-kernel: each finishes before next starts
Why is there a race condition?	`cuMemcpyDtoH` doesn’t synchronize with non-blocking stream kernels	Per-transfer: D2H reads stale data
Why are kernels on a non-blocking stream?	trueno `CudaStream::new()` uses `CU_STREAM_NON_BLOCKING`	Per-kernel: stream creation policy
Why is there a D2H transfer mid-backward?	`compute_workspace_clip_scale()` downloads 9 gradient buffers for L2 norm	Root cause: no sync before D2H

Fix: stream.synchronize() at 3 locations in cuda_trainer.rs before cuMemcpyDtoH-based gradient clipping (entrenar@d3a3d26).

Verification: Training stable without CUDA_LAUNCH_BLOCKING=1 at 441 tok/s (vs 402 with blocking). Process alive for 2.5+ minutes past the crash point.

ALB-067: Per-Block Weight Gradient Clipping CPU Bottleneck (High)

Discovery: 350M v2 training (2026-03-03) running at ~120 tok/s with gradient_accumulation: 16. Profiling showed the majority of per-step time spent in compute_workspace_clip_scale() — synchronous D2H transfers for gradient L2 norm computation.

Five Whys:

Why	Finding	Brick Boundary
Why is training only 120 tok/s?	Per-step time dominated by gradient clipping, not forward/backward	Per-step: clipping >> compute
Why is gradient clipping slow?	`compute_workspace_clip_scale()` downloads 9 GPU buffers per block to CPU for L2 norm	Per-block: 9 D2H transfers × 24 blocks
Why 9 buffers per block?	Each block has q/k/v/o_proj + gate/up/down + norm weights + bias = 9 gradient buffers	Per-kernel: one cuMemcpyDtoH per buffer
Why is each D2H slow?	Each `cuMemcpyDtoH` is a synchronous PCIe round-trip (~5-10 us latency) with `stream.synchronize()`	Per-transfer: PCIe latency-bound
Why no GPU-side norm reduction?	trueno has no squared-norm reduction kernel — must download to CPU for `f32::sqrt()`	Root cause: missing GPU-side L2 norm kernel in trueno

Total D2H transfers per optimizer step: 9 buffers × 24 blocks × 4 micro-batches (grad_accum=16, but clip runs per accumulation group) = 864 D2H transfers. At ~5-10 us each = 4.3-8.6 ms of pure PCIe latency per step, plus the CPU-side L2 norm computation on downloaded buffers.

Workaround (entrenar@eaadbc6): Disabled per-block weight gradient clipping entirely. Kept LM head clipping, final norm clipping, and activation gradient clipping (C-EMBED-GRAD-001) — these are single-buffer clips, not 864-transfer bottlenecks.

Update (2026-03-04): GPU-side squared norm kernel already exists in trueno (SquaredSumKernel, KAIZEN-049/054/055). compute_workspace_clip_scale_gpu + clip_workspace_gradients already wired. Per-block clipping just needs grad_clip: 1.0 re-enabled in YAML config to use GPU-side path.

Verification: 350M training at 480 tok/s (4× improvement), 8.4s/step, 11.7h ETA for 5000 steps. Training stable with grad_clip and monitoring disabled for this run.

ALB-069: PTX selp_f32 Argument Order Bug (Critical)

Discovery: 350M v2 training produced loss=0.0000 at every step. The fused cross-entropy kernel returned zero loss because selp_f32 (PTX conditional select) had its arguments in the wrong order.

Five Whys:

Why	Finding	Brick Boundary
Why is loss exactly 0.0?	Fused CE kernel returns zero for every token	Per-kernel: CE output buffer all zeros
Why does CE return zero?	PTX `selp_f32` assembler error	Per-kernel: JIT compilation fails silently
Why does selp fail?	`selp_f32(pred, true_val, false_val)` called as `(true_val, false_val, pred)`	Per-kernel: arg order mismatch
Why wrong arg order?	Same class as ALB-059 (GEMM backward constructor arg swap)	Pattern: API args don’t match variable names
Why no test caught this?	Unit tests used pre-computed expected values, not end-to-end validation	Root cause: missing integration test

Fix: selp_f32(is_target, grad_target, grad_nontarget) at both call sites (trueno@10bec89, trueno#156).

ALB-070: YAML save_interval Field Mismatch + eval_batch Overflow (Critical)

Discovery: After ALB-069 fix, training immediately crashed. Two bugs:

Config field mismatch: YAML bridge reads training.checkpoint.save_every, not training.save_interval. With #[serde(default)], missing field silently defaults to save_interval=1 → validation eval runs every step.
eval_batch buffer overflow: eval_batch() didn’t truncate sequences to max_seq_len, unlike train_step_single(). Long validation sequences overflowed pre-allocated GPU buffers.

Fix: YAML config uses checkpoint.save_every: 25. eval_batch() now truncates to max_seq_len (entrenar@5c4c2d8). Same class as ALB-060 (config field mismatch).

ALB-071: Embed Gradient Clipping Disabled When grad_clip=None (Critical)

Discovery: 350M v2 training with ALB-069+070 fixes produced loss=0.0 by step ~100. All block weights became NaN. Root cause: C-EMBED-GRAD-001 (activation gradient clipping at GPU→CPU boundary) was gated behind if let Some(max_norm) = max_grad_norm. ALB-067 disabled grad_clip in YAML → no embed grad clipping → CPU AdamW overflow → 304K NaN in 33.5M embedding table → NaN propagates to all blocks.

Five Whys:

Why	Finding
Why loss=0.0?	All block weights NaN → forward produces NaN → CE loss masked to 0
Why NaN weights?	Block 0 optimizer receives NaN from LM head, which gets NaN from embedding
Why NaN embedding?	CPU AdamW second moment overflow from unclipped activation gradient
Why unclipped gradient?	`max_grad_norm` is `None` (ALB-067 disabled it)
Why does None disable safety clipping?	Safety constraint coupled to optional hyperparameter

Fix: unwrap_or(1.0) makes embed grad clipping unconditional (entrenar@d07d67d). Lesson: Safety constraints (numeric stability) must NEVER be coupled to optional training hyperparameters.

ALB-072: fp16 Loss Scaling Causes NaN in Early Transformer Layers (Critical)

Discovery: Even after ALB-071 fix, training still produced loss=0.0 at step 169. Diagnostic testing revealed FP32 (no mixed precision) worked perfectly (gnorm=2.29) but FP16 produced NaN in layers 0-1.

Five Whys:

Why	Finding	Brick Boundary
Why loss=0.0 at step 169?	Block weights in layers 0-1 are NaN after step 1	Per-block: blocks 0-1 diverge
Why NaN in early layers?	Activation gradient overflows f32 after 24-layer backward amplification	Per-block: gradient magnitude grows per layer
Why does gradient overflow?	fused CE kernel outputs gradient × 65536 (GradScaler scale)	Per-kernel: loss_scale includes grad_scaler
Why include grad_scaler?	AMP pattern: scale loss to prevent fp16 gradient underflow	Per-transfer: designed for fp16 tensors
Why is this harmful?	All backward uses f32 GpuBuffers — no fp16 underflow risk, but 65536× overflow	Root cause: unnecessary scaling

Diagnostic testing:

FP16 without grad_clip: NaN in layers 0-1 (14 NaN tensors)
FP16 with grad_clip=1.0: Same NaN in layers 0-1 (14 NaN tensors)
FP32 (no mixed precision): ALL tensors OK, gnorm=2.29

Fix: Exclude grad_scaler.scale() from loss_scale computation. Loss scale is now 1.0 / seq_len only (entrenar@44d3e74). gnorm matches FP32 baseline exactly.

Verification: 50-step test — all 218 tensors OK, gnorm growing naturally 2.29→9.57. Full training: step 500 checkpoint verified OK (1520 MB), val_loss=6.92, val_ppl=1008.

Lesson: AMP loss scaling is ONLY needed when backward computation uses fp16 tensors. With f32 backward, it amplifies gradients through deep networks causing overflow.

Post-Training Pipeline Validation Detail

Quantization (2026-03-03)

Model	Scheme	Original	Quantized	Reduction	Notes
50M	Int4	238 MiB	30 MiB	87.5% (8.0x)	Working as expected
50M	Q4K	238 MiB	238 MiB	0% (1.0x)	No-op — entrenar saves 1D flat tensors; Q4K requires 2D
350M	Int4	1.48 GiB	191 MiB	87.5% (8.0x)	Working as expected
350M	Q4K	1.48 GiB	1.48 GiB	0% (1.0x)	No-op — same 1D tensor issue

Finding: apr quantize -s q4k is a no-op on entrenar checkpoints because entrenar stores weights as 1D flat tensors, and Q4K quantization requires 2D weight matrices to compute per-block statistics. Int4 (simple bit-width reduction) works correctly. Fix: either (a) reshape before quantize, or (b) run convert-checkpoint.py first to produce HF-format 2D tensors.

Pruning (2026-03-03)

Model	Method	Params	Zeros	Output Size	Notes
50M	Magnitude (0.5)	62.4M	31.2M (50.0%)	238 MiB	Working — 50% sparsity
50M	Depth (layers 8-11)	62.4M→47.2M	1	180 MiB	Working — 4 layers removed
350M	Magnitude (0.3)	398.5M	199.2M (50.0%)	1.48 GiB	Bug: sparsity=0.3 produced 50% — param may be ignored

Finding: apr prune --method magnitude --sparsity 0.3 on 350M checkpoint produced 50.0% zeros instead of 30.0%. The --sparsity parameter may not be correctly wired through to the pruning implementation for magnitude pruning. Depth pruning works correctly.

Distillation Setup (2026-03-03)

Teacher	Size	Tensors	Precompute	Notes
Qwen2.5-Coder-0.5B	942 MiB	290	PASS	Single-file SafeTensors, loads in realizar
Qwen2.5-Coder-3B	5.75 GiB	434	PASS	Sharded SafeTensors (2 files), loads in apr distill

Finding: realizar doesn’t support sharded SafeTensors (multiple .safetensors files). apr distill uses RosettaStone which handles sharding. For inference with realizar, the 3B model would need to be merged into a single file.

Data Expansion (2026-03-03)

Source	Type	Files	Parquet Size
depyler	Tier 1	1,843	5.8 MiB
hf-ground-truth	Tier 1	11,493	188 MiB
jax	Tier 1	2,637	47 MiB
vllm (original)	Tier 1	1,100	17 MiB
pytorch	Tier 2	3,801	15.6 MiB
hf-repos	Tier 2	19,781	73.8 MiB
mlflow	Tier 2	1,780	4.6 MiB
vllm-full	Tier 2	2,239	7.7 MiB
tgi	Tier 2	372	1.0 MiB
algo-corpus	Tier 2	186	0.2 MiB
cuda-python	Tier 2	157	0.4 MiB
llms-with-hf	Tier 2	37	35 KiB

Pipeline: 45,420 mixed rows → 45,420 FIM (50% PSM) → 67,977 pretokenized sequences (2048 tokens each)

Token count: 139M tokens (up from 45M — 3.1× expansion)

C-TRAINCFG-001 pre-flight for pretrain-350m-v2.yaml:

steps_per_epoch: 132
min_epochs: 38 (38 × 132 = 5016 ≥ 5000)
warmup_steps: 500 (10% of 5000)
total_tokens: 2.6B

World-Class MLOps Survey (2026-03-03)

Conducted scientific survey of 12 production training frameworks (Megatron-LM, DeepSpeed, TorchTitan, OLMo, Llama 3, PaLM, MegaScale, NeMo, Composer, Nanotron, Levanter, GPT-NeoX) against entrenar/albor sovereign stack.

Methodology: arXiv literature review + batuta falsify + capability audit.

Category	Before	After	Max
Checkpointing	2.5	10.0	10
Fault tolerance	2.0	10.0	10
Observability	4.5	10.0	10
Mixed precision	0.5	5.0	5
Gradient management	4.5	10.0	10
Data pipeline	4.5	10.0	10
LR & optimization	3.0	5.0	5
Evaluation	1.0	10.0	10
Distributed	0.0	10.0	10
Reproducibility	2.5	5.0	5
Security	2.0	5.0	5
Configuration	2.5	5.0	5
Provable correctness	4.5	5.0	5
Total	34	100	100

Grade: F (34%) → A+ (100%). 51 dogfooding entries, 54 MLOps features across 14 batches. All features are pure Rust — no Python scripts count toward the score.

Implemented (45 items, batches 1-9):

Checkpointing (10/10): optimizer state persistence, async save, step-numbered retention, integrity verification, training state, data loader state, LR scheduler state, RNG state, full resume
Fault tolerance (10/10): auto-restart (apr train watch), crash diagnostics, heartbeat monitoring, graceful SIGINT shutdown, NaN detection, loss spike rollback, ZClip, multi-checkpoint retention, error classification
Observability (10/10): gradient norm, MFU, GPU memory, step timing, JSONL+SQLite experiment tracking, real-time TUI dashboard
Gradient (8.5/10): B_noise estimation, ZClip adaptive spike detection, NaN/Inf skip, per-parameter-group grad norms (R-040)
Data (9.5/10): shuffling per epoch, dedup (alimentar dedup), quality filtering (alimentar filter-text), curriculum learning (R-023)
Evaluation (10/10): HumanEval pass@k, contamination detection, model comparison, PPL-benchmark correlation (apr eval --task correlation), human evaluation pipeline (apr eval --task human), checkpoint verification
LR & optimization (5/5): hyperparameter sweep (apr train sweep)
Reproducibility (4/5): checkpoint archival (apr train archive)
Security (5/5): model weight encryption (apr encrypt/apr decrypt)
Configuration (5/5): comprehensive resource estimation (apr train plan R-095)
Mixed precision (5/5): BF16-precision GEMM kernel (gemm_forward_bf16), GradScaler, GPU f32↔bf16 cast kernels, FP32 optimizer moments, CPU reference gemm_bf16_reference (R-002 batches 12+14)
Distributed (10/10): DDP with per-block AllReduce, ring AllReduce, streaming Parquet loader, wire protocol v2, distributed checkpoint, heterogeneous device enumeration (batches 10-11). Tensor parallelism (Megatron-LM column+row), pipeline parallelism (1F1B), sequence parallelism (ring attention), ZeRO-1 optimizer sharding, elastic worker add/remove (batch 13)
Gradient (10/10): gradient accumulation across micro-batches + global norm clipping (batch 10)
Data (10/10): streaming Parquet loader with file-level sharding (batch 10)
Reproducibility (5/5): Kani verification harnesses (batch 10)
Provable (5/5): 4 new contracts C-DDP-001, C-RING-001, C-WIRE-002, C-SHARD-001 (batch 10)

Complete. Zero remaining gaps. MLOps survey: 100% (A+ perfect), 100 PASS / 0 PARTIAL / 0 FAIL. All 13 categories at 100%.

Full survey: entrenar/docs/specifications/world-class-mlops-survey.md

Tool Availability

All sovereign stack tools are installed and reachable:

Tool	Path	Version
`apr`	`/home/noah/.local/bin/apr`	aprender
`pv`	`/home/noah/.cargo/bin/pv`	provable-contracts
`forjar`	`/home/noah/.cargo/bin/forjar`	forjar
`alimentar`	`/home/noah/.cargo/bin/alimentar`	alimentar
`batuta`	`/home/noah/.cargo/bin/batuta`	batuta
`pmat`	`/home/noah/.cargo/bin/pmat`	pmat
`bashrs`	`/home/noah/.cargo/bin/bashrs`	bashrs v6.65.0

ALB-073: fused_cross_entropy PTX selp Argument Mismatch (High)

Discovery: Training log showed repeated PTX JIT compilation failures:

ptxas application ptx input, line 182; error: Arguments mismatch for instruction 'selp'

Five Whys (per CLAUDE.md Rule 7):

Why did PTX fail to compile? → selp instruction received arguments in wrong order (type mismatch at position).
Why were arguments in wrong order? → selp_f32(true_val, false_val, pred) instead of (pred, true_val, false_val). Same class as ALB-069.
Why wasn’t it caught by ALB-069 fix? → The fused cross-entropy kernel was written/updated independently. The selp pattern was copy-pasted from unfixed code.
Why did training continue despite the error? → trueno has a fallback code path when JIT compilation fails. Training used the non-fused cross-entropy.
Why no regression test for PTX compilation? → PTX JIT happens at runtime on specific GPU targets (sm_89). CI doesn’t have GPU hardware.

Fix: trueno@10bec89 — corrected selp_f32 argument order in fused cross-entropy kernels.

Lesson: Same class of bug recurring (ALB-059, ALB-069, ALB-073) indicates a systematic issue. selp_f32 helper should be wrapped in a typed macro/function that makes argument order unambiguous.

ALB-074: Buffer Overflow from Stale Binary (Critical)

Discovery: Training crashed at step 1183 with:

range end index 2096128 out of range for slice of length 1048576

at cuda_trainer.rs:711.

Five Whys (per CLAUDE.md Rule 7):

Why did the buffer overflow? → A 2048-token sequence was passed to GPU buffers sized for max_seq_len=1024 (2048×1024 > 1024×1024).
Why wasn’t the sequence truncated? → The eval_single_sequence path in the running binary lacked the truncation fix from ALB-070.
Why was the binary stale? → cargo build said “already up to date” because Cargo’s fingerprinting didn’t detect the entrenar source change. The binary was from 20:55 but the fix was committed after the binary was linked.
Why only at step 1183? → The eval path is triggered at save_interval=250. The crash likely occurred during a validation eval when a 2048-token sequence was processed. Steps 250/500/750/1000 worked because those sequences happened to be ≤1024 tokens.
Why didn’t the train path crash? → train_step_single already had truncation. Only eval_single_sequence was missing it.

Fix: Force rebuild with touch src/train/transformer_trainer/cuda_trainer.rs to invalidate Cargo fingerprint, then rebuild. Verified: no crash on 5-step test.

Lesson: When patching upstream dependencies, always force-rebuild with touch or cargo clean -p to ensure Cargo picks up changes. Fingerprinting heuristics can miss source changes in [patch.crates-io] dependencies.

Data Scaling (2026-03-05)

codeparrot/codeparrot-clean: 5M Python files on HuggingFace (no gating).

Metric	Value
Files downloaded	2,000,000
Filter pass rate	99.2%
Raw size	6.1 GB (20 Parquet shards)
Estimated raw tokens	~4.4B
Pretokenized (seq=1024)	~5.2M sequences × 1024 = ~5.3B tokens
Download time	499s (~8.3 min)
Pretokenize time	~2h (20 shards × ~6 min/shard)

Quality filters: skip autogenerated, alpha_frac < 0.25, files > 100KB, < 50 chars.

Appendix G: Data Pipeline

Documents the Phase 1 data ingestion, tokenization, and augmentation pipeline.

Source Corpora

Source	Repository	Files	Rows	Parquet Size
depyler	depyler examples + TDD book	1,843	1,843	6MB
hf-ground-truth	HuggingFace ground truth corpus	11,928	11,493	197MB
jax-ground-truth	JAX ground truth corpus	2,697	2,637	50MB
vllm-ground-truth	vLLM ground truth corpus	1,118	1,100	18MB

All sources are Python code, collected via alimentar import local.

Training Mix

Weighted sampling with Tier 1 (depyler) upsampled:

alimentar mix \
  depyler.parquet:0.4 \
  hf.parquet:0.3 \
  jax.parquet:0.15 \
  vllm.parquet:0.15 \
  --output mixed.parquet \
  --seed 42

Result: 17,070 rows (depyler upsampled 3.7x from 1,843 to ~6,829).

Data Splits

Split	Rows	Size	Seed	Weights
train	17,070	201MB	42	depyler:0.4 hf:0.3 jax:0.15 vllm:0.15
val	500	7MB	123	equal 0.25 each
test	200	2.4MB	456	equal 0.25 each

FIM Augmentation

Fill-in-the-Middle transforms applied via alimentar fim:

alimentar fim mixed.parquet \
  --output mixed-fim.parquet \
  --column text \
  --rate 0.5 \
  --format psm \
  --seed 42

Format: PSM (Prefix-Suffix-Middle)
Rate: 50% of rows receive FIM transform
Sentinel tokens: <|fim_prefix|>, <|fim_suffix|>, <|fim_middle|>

BPE Tokenizer

Trained via apr tokenize apply:

apr tokenize apply \
  --data corpus-raw.txt \
  --vocab-size 32768 \
  --algorithm bpe \
  --max-lines 100000 \
  -o tokenizer/

Results:

Final vocab size: 32,768
Merges: 32,518
Training time: 2022.5s (~33.7 min)
Training data: 100K lines of Python code
Special tokens: <unk>, <s>, </s>, <pad>
Python pattern coverage: 8/8 (def, return, self, import, class, for, if, in)
Output: tokenizer/vocab.json + tokenizer/merges.txt

HuggingFace tokenizer.json Conversion

Entrenar requires HuggingFace tokenizer.json format, but apr tokenize apply produces raw vocab.json + merges.txt. A Python conversion step bridges the gap (ALB-033):

from tokenizers import Tokenizer, models, pre_tokenizers, decoders
bpe = models.BPE(vocab=vocab, merges=merges, end_of_word_suffix='</w>')
tokenizer = Tokenizer(bpe)
tokenizer.pre_tokenizer = pre_tokenizers.Split(pattern=' ', behavior='removed')
tokenizer.decoder = decoders.BPEDecoder(suffix='</w>')
tokenizer.save('models/albor-tokenizer/tokenizer.json')

Key details:

Merges must be string format ("i n") not array format (["i", "n"])
Pre-tokenizer matches aprender’s split_whitespace() behavior
</w> end-of-word suffix matches aprender’s BPE encoding
Regular vocab: 32,768 tokens (IDs 0-32767)
FIM special tokens: 3 additional (IDs 32768-32770)

Parquet Schema

All data files use a consistent schema:

{
  text: Utf8,    -- Python source code
  source: Utf8,  -- Corpus name (depyler, hf, jax, vllm)
  file: Utf8     -- Original file path
}

Provenance

SHA-256 hashes for all data artifacts are recorded in docs/PROVENANCE.md. Each split uses a different random seed for reproducibility.

ByteLevel BPE Tokenizer (v2)

The v1 tokenizer (from apr tokenize apply) normalizes whitespace, which loses Python indentation. The v2 tokenizer uses ByteLevel BPE (like GPT-2/CodeLlama):

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tokenizer.decoder = decoders.ByteLevel()
trainer = trainers.BpeTrainer(vocab_size=32768, special_tokens=[...])
tokenizer.train(["corpus-raw.txt"], trainer)
tokenizer.save("models/albor-tokenizer-v2/tokenizer.json")

Vocab: 32,768 (same size, different encoding)
Roundtrip: 6/6 PASS (preserves newlines, indentation, blank lines)
Merges: 32,557

Pre-Tokenized Data

Training data pre-tokenized and chunked for efficient training:

Dataset	Sequences	Seq Length	Total Tokens	Format
pretokenized-2048/train (v1)	22,079	2048	45.2M	Parquet (input_ids: List<u32>)
pretokenized-2048/val	814	2048	1.7M	Parquet (input_ids: List<u32>)
pretokenized-2048-v2/train	67,977	2048	139M	Parquet (input_ids: List<u32>)
pretokenized-2048-v2/val	814	2048	1.7M	Parquet (reused from v1)

Pre-tokenization avoids the entrenar↔aprender BPE compatibility issue (ALB-033) and enables direct input_ids column loading.

v2 Data Expansion (2026-03-03)

The v2 dataset expands from Tier 1 only to Tier 1 (10x upsampled) + 8 Tier 2 repos:

Source	Type	Files	Weight
depyler	Tier 1	1,843	10x
hf-ground-truth	Tier 1	11,493	10x
jax-ground-truth	Tier 1	2,637	10x
vllm-ground-truth	Tier 1	1,100	10x
pytorch	Tier 2	3,801	1x
hf-repos	Tier 2	19,781	1x
mlflow	Tier 2	1,780	1x
vllm-full	Tier 2	2,239	1x
tgi	Tier 2	372	1x
algo-corpus	Tier 2	186	1x
cuda-python	Tier 2	157	1x
llms-with-hf	Tier 2	37	1x

Pipeline: source-to-parquet.py → alimentar mix → alimentar fim (50% PSM) → pretokenize.py

Key finding: alimentar import local expects data files (CSV/JSON/Parquet), not source code directories. The workaround script scripts/source-to-parquet.py converts Python repos to Parquet with the Tier 1 schema (file, source, text columns).

Result: 45,420 mixed rows → 67,977 pretokenized sequences × 2048 = 139M tokens (191 MiB).

Tools Used

alimentar import local — JSONL to Parquet conversion
alimentar mix — weighted sampling with upsampling
alimentar fim — Fill-in-the-Middle augmentation
apr tokenize plan/apply — BPE vocabulary training (v1, whitespace-split)
Python tokenizers — ByteLevel BPE training (v2, whitespace-preserving)
scripts/source-to-parquet.py — Python source code to Parquet (for Tier 2 repos)
entrenar (parquet feature) — Parquet-to-LMBatch bridge for training

Keyboard shortcuts

Albor LLM Specification