11. Gap Register

Every gap discovered during development is tracked here. Each gap maps to a specific upstream component, a GitHub issue, and a clear acceptance criterion.

Lifecycle: Gap discovered → GitHub issue filed → implemented upstream → wired into apr → dogfooded in albor pipeline → FALSIFY/pmat verified → closed.

Status	Meaning
OPEN	Gap identified, not yet implemented
IN PROGRESS	GitHub issue filed, work underway
DOGFOODING	Implemented, being validated in albor pipeline
CLOSED	Verified working end-to-end, issue closed

11.1 Critical Path Gaps (Block the Improvement Ladder)

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-001	#6	apr (aprender)	`apr tokenize plan/apply` subcommand	Medium	FIXED	`apr tokenize plan` validates inputs + estimates time; `apr tokenize apply` trains BPE/WordPiece/Unigram tokenizer (`aprender@90427205`). Writes vocab.json + merges.txt.
ALB-006	#7	apr (aprender)	`apr eval plan/apply` benchmark harness	High	FIXED	`apr eval --task code --data benchmark.jsonl` evaluates code completion with pass@1 scoring. `apr eval --task plan` validates model + data exist. JSONL format with prompt/test/canonical_solution. Phase 1: structural validation. Phase 2: full inference (ALB-009 prerequisite). (`aprender@4e61297e`)
ALB-007	#8	entrenar	Parquet→LMBatch bridge via alimentar	Medium	FIXED	`load_lm_batches_from_parquet()` reads text or pre-tokenized Parquet (single file or directory of shards) via alimentar. Text columns tokenized with HfTokenizer. Column auto-detection (input_ids/token_ids for pre-tokenized, text/content/code for text). Gated behind `parquet` feature. (`entrenar@a5a2fb7`)
ALB-009	#1	apr (entrenar)	`apr train plan/apply` for pre-training from scratch	Critical	FIXED	`apr train plan --task pretrain --config <yaml>` validates config via entrenar, shows model architecture and training params. `apr train apply --task pretrain --config <yaml>` runs full pre-training via `train_from_yaml()` (TransformerTrainer + CausalLMLoss). Config updated to match entrenar TrainSpec schema. (`aprender@d79ed943`)
ALB-010	#2	realizar	Qwen3.5-35B-A3B MoE inference (teacher for distillation)	Critical	DOGFOODING	Steps 1-5b MERGED (PR #133): types, router, expert dispatch, forward integration, shared expert gate, architecture registration, config fields. Step 6 (PR #135): SafeTensors MoE weight loading — `detect_model_prefix` (ConditionalGeneration wrapper), `extract_layer_generic_with_prefix`, `load_moe_weights` (router, packed experts, shared expert), GPU adapter wiring. 15,054 tests pass. Remaining: end-to-end dogfood with Qwen3.5-35B-A3B model files.
ALB-011	#3	apr (entrenar + realizar)	`apr distill plan/apply` (precompute + train stages)	Critical	FIXED	`apr distill --config <yaml> --plan` validates config, shows teacher/student/training params. `apr distill --config <yaml> --stage precompute` inspects teacher, writes manifest. `apr distill --config <yaml> --stage train` validates precompute manifest, sets up KD training. Local DistillYamlConfig matches entrenar schema. (`aprender@81dd4432`)
ALB-018	#19	entrenar/alimentar	Fill-in-the-Middle (FIM) data transform (PSM/SPM)	High	FIXED	`alimentar fim` transform with PSM/SPM formats, configurable rate/seed (`alimentar@290582d`). `Fim` struct implements `Transform` trait for pipeline integration.
ALB-019	#20	alimentar	`alimentar import local` for local Python files	Medium	FIXED	`alimentar import local` subcommand now available (`alimentar@265541b`). Supports CSV/JSON/JSONL/Parquet format conversion.
ALB-020	#21	alimentar	`alimentar mix` with weighted upsampling	Medium	FIXED	`alimentar mix` with weighted sampling and upsampling now available (`alimentar@64b1e92`). Syntax: `alimentar mix a.parquet:0.8 b.parquet:0.2 -o out.parquet`.
ALB-021	#22	entrenar	Custom model architecture params in YAML	High	FIXED	`ArchitectureOverrides` struct carries YAML manifest `architecture:` params through bridge converter to `TransformerConfig`. Supports all fields: `hidden_size`, `num_layers`, `num_heads`, `num_kv_heads`, `intermediate_size`, `vocab_size`, `max_seq_length`, `rms_norm_eps`, `rope_theta`, `use_bias`. (`entrenar@a414861`)
ALB-022	#23	entrenar	Human-readable value shorthand in YAML configs	Low	FIXED	`parse_human_usize()` and `deserialize_human_usize_opt` support SI suffixes (32K, 1M, 10B, 1T), scientific notation (1e6), and fractional suffixes (1.5K). Applied to ArchitectureConfig and DataConfig fields. (`entrenar@1cb0950`)
ALB-023	#24	apr (aprender)	Plan/apply contract for all subcommands	High	FIXED	Every `apr <cmd>` action command now exposes plan mode: `merge --plan`, `export --plan`, `publish --plan` added to join existing `train plan/apply`, `tokenize plan/apply`, `quantize --plan`, `finetune --plan`, `prune --plan`, `distill --plan`, `eval --task plan`. Pre-dispatch contract validation skipped in plan mode. (`aprender@526a1e4b`)
ALB-024	#25	apr (aprender)	`apr experiment view` — interactive SQLite experiment browser	Medium	FIXED	`apr experiment view --global` opens ratatui TUI with run table, sparkline, and braille loss chart. `--json` mode for CI. Reads local or global `~/.entrenar/experiments.db`. (`aprender@1196d244`)
ALB-025	#26	presentar + apr	`apr monitor` upgrade — presentar widgets for live training TUI	Medium	FIXED	`TrainingDashboard` composes presentar-terminal `Meter`, `GpuPanel`, `Sparkline`, `Text`, `Border`, `Layout` (ALB-057). `TuiApp` handles resize/Ctrl+C/diffing (ALB-047/048). WASM compilation deferred to ALB-026. (`entrenar@0ad416e`)
ALB-026	#27	presentar	WASM training dashboard — `albor-dashboard.yaml`	Medium	OPEN	Declarative YAML dashboard config that renders training metrics, experiment comparison, and model card via `presentar serve`. Embeddable in HuggingFace model card as static WASM artifact.
ALB-027	#4	forjar	`task` resource type for pipeline orchestration	Critical	FIXED	New forjar resource type: runs arbitrary command, tracks exit code, hashes `output_artifacts` for idempotency via b3sum, supports `completion_check` and `timeout`. Handlers: `check_script` (completion_check or artifact existence), `apply_script` (set -euo pipefail, working_dir, timeout), `state_query_script` (b3sum artifacts). Validation: command required, timeout > 0. (`forjar@d14e633`)
ALB-028	#5	apr (aprender)	`apr pipeline plan/apply` wrapping forjar DAG engine	Critical	FIXED	`apr pipeline plan` shows full DAG with 23 resources across 2 machines. `apr pipeline apply` converges via forjar engine. `apr pipeline status` shows state. `apr pipeline validate` checks manifest. Shells out to forjar binary (decoupled). (`aprender@e653d5ca`)

11.2 Distributed Training Gaps (Stretch / Future)

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-002	#9	repartir	Ring all-reduce implementation	High	OPEN	Gradient tensors synchronized across 2+ workers with <5% overhead
ALB-003	#10	entrenar	repartir integration for distributed training	High	OPEN	Training loop calls `repartir::GradientSync` for multi-worker training
ALB-004	#11	entrenar	Unified CUDA + wgpu backend dispatch	Medium	OPEN	Same training config runs on CUDA (4090) and wgpu (W5700X)
ALB-005	#12	trueno	wgpu backward pass (gradient WGSL shaders)	High	OPEN	Compute shaders for matmul_backward, gelu_backward, rmsnorm_backward, attention_backward
ALB-008	#13	repartir	Heterogeneous worker throughput balancing	Medium	OPEN	Workers with different GPU speeds get proportional workload

11.3 Quality & Verification Gaps

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-013	#14	provable-contracts	Knowledge distillation contract	High	DOGFOODING	`knowledge-distillation-kernel-v1.yaml` — committed and passes `pv validate`. 3 equations, 6 obligations, 5 falsification tests, 2 Kani harnesses. Needs binding to entrenar implementation.
ALB-014	#15	provable-contracts	BPE tokenizer contract	Medium	DOGFOODING	`bpe-tokenizer-kernel-v1.yaml` — committed and passes `pv validate`. Roundtrip invariant, FIM sentinel tests. Needs binding to aprender BPE.
ALB-015	#16	provable-contracts	Model merging contract (SLERP, TIES, DARE)	Medium	DOGFOODING	`model-merging-kernel-v1.yaml` — committed and passes `pv validate`. SLERP bound, DARE unbiased estimator. Needs binding.
ALB-016	#17	provable-contracts	Pruning contract (WANDA, magnitude)	Medium	DOGFOODING	`pruning-kernel-v1.yaml` — committed and passes `pv validate`. Sparsity invariant, score ordering. Needs binding.
ALB-017	#18	provable-contracts	Gradient accumulation contract	High	DOGFOODING	`gradient-accumulation-kernel-v1.yaml` — committed and passes `pv validate`. Numerical equivalence, gradient zeroing. Needs binding.

Contract coverage report (pv coverage contracts): 8 contracts, 31 equations, 51 obligations, 34 falsification tests, 10 Kani harnesses, 100% obligation coverage. All contracts at impl=0/N — waiting for upstream bindings.

11.4 Dogfooding-Discovered Gaps

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-029	#28	batuta	`batuta falsify` false positives on project repos	Medium	FIXED	Fixed upstream in `batuta@905a862`: AI-01 searches `configs/`, AI-04 excludes `book-output/`, AI-05 detects pv/forjar validation. Score: 72.2% → 73.1%.
ALB-030	#29	batuta	`batuta stack status` fails without Cargo.toml	Low	FIXED	Fixed upstream in `batuta@371557a`: Falls back to binary detection, discovers 11 installed PAIML tools with versions.
ALB-031	#30	batuta	`batuta hf search` returns mock/placeholder data	Low	OPEN	`batuta hf search model "code completion"` returns live HuggingFace Hub results instead of placeholder models.
ALB-033	#31	apr (aprender)	`apr tokenize` → entrenar tokenizer.json format gap	Medium	DOGFOODING	`apr tokenize apply` produces vocab.json + merges.txt but entrenar expects HuggingFace tokenizer.json. Workaround: Python `tokenizers` lib.
ALB-034	#32	entrenar	`max_steps` config not respected in training loop	Medium	FIXED	`max_steps` wired through YAML manifest → bridge → TrainingParams → TransformerTrainConfig → trainer loop. Training stops when optimizer step count reaches limit (`entrenar@07db101`).
ALB-035	#33	entrenar	Does not write `training_state.json` during training	Medium	FIXED	Added `train_epoch_with_callback()` and per-step logging (~100 lines/epoch) in `entrenar@5d41a96`.
ALB-036	#34	apr (aprender)	BPE tokenizer normalizes whitespace	Medium	DOGFOODING	`split_whitespace()` pre-tokenizer destroys Python indentation. Workaround: ByteLevel BPE v2.
ALB-037	#35	realizar	SafeTensors inference ignores loaded weights	High	FIXED	Root cause chain: ALB-038 (no gradient flow) → ALB-043 (backward_ffn buffer overflow + wrong SwiGLU gradients). Secondary: entrenar didn’t save config.json (`entrenar@6097780`). Verified e2e: `realizar run` loads 350M trained checkpoint (218 tensors), generates tokens from learned weights.
ALB-038	#36	entrenar	Saves initialization weights, not trained weights	Critical	FIXED	Root cause: `RMSNorm::forward_batched()` created tensors with no backward op, blocking all gradient flow. Attention `forward()` also broke Q/K/V gradients. Fixed in `entrenar@91ba9da` (norm backward) and `entrenar@1ede409` (attention backward). All 20 model parameters now receive gradients.
ALB-040	#38	entrenar	GPU-resident pretraining — wire CudaTransformerBlock into TransformerTrainer	Critical	VERIFIED	`CudaTransformerTrainer` in `cuda_trainer.rs` follows classify_pipeline.rs pattern. 3 PCIe transfers/step vs 16K. Auto-detect CUDA with graceful CPU fallback. Contract: `training-gpu-kernel-v1.yaml`. 350M verified: 50-step test loss 10.39→6.07, checkpoint valid, realizar loads + generates. Full training running (seq=1024, batch=4, accum=128).
ALB-041	#39	entrenar	D2D buffer size mismatch in CudaTransformerBlock backward_attention	High	FIXED	`backward_attention()` used `gate_out` (intermediate_size) as temp buffer for `grad_hidden` accumulation, but D2D copy requires exact size match. Fixed: use `o_proj_out` (hidden_size). Also added seq_len truncation and error logging in `CudaTransformerTrainer`. (`entrenar@a48e3d2`)
ALB-042	#40	entrenar	CudaTransformerTrainer runtime errors → silent loss=0.0 instead of CPU fallback	Medium	OPEN	When CUDA operations fail during training (e.g., VRAM contention), trainer should detect N consecutive failures and gracefully fall back to CPU mode. Currently reports loss=0.0 and saves garbage checkpoint. Workaround: `CUDA_VISIBLE_DEVICES=""`.
ALB-043	#41	entrenar	backward_ffn buffer overflow + missing SwiGLU gradients	Critical	FIXED	Two bugs: (1) `silu_backward` wrote [S,I] output into [S,H] buffer (4× overflow → `CUDA_ERROR_ILLEGAL_ADDRESS`). (2) SwiGLU backward missing `×up` factor in gate gradient; `grad_up`/`grad_w_up` completely absent (w_up never trained). Fixed with correct 10-step decomposition using `elementwise_mul_forward`, `silu_forward`, `silu_backward`. (`entrenar@f7805f1`)
ALB-044	#42	entrenar	Unclipped activation gradients + CPU optimizer hyperparameter mismatch cause 350M NaN	Critical	FIXED	Two bugs: (1) Activation gradient from block[0] backward (~1e35) unclipped — per-block clipping only applies to weight gradients in CudaGradWorkspace. (2) CPU AdamW used `default_params(lr)` (β₂=0.999, wd=0.01) instead of YAML config (β₂=0.95, wd=0.1) — 50× bias correction amplification overflows f32. Fixed: C-EMBED-GRAD-001 clips activation gradient before scatter-add; CPU optimizer matches YAML hyperparams. 350M now trains without NaN.
ALB-045	—	entrenar	`train_loop_cuda` does not write `training_state.json` — `apr monitor` blind to pretraining	Critical	FIXED	`write_training_snapshot()` helper in `src/config/train/loader.rs` writes `TrainingSnapshot` to `training_state.json` on every log interval. Both `train_loop_cuda` and `train_loop_cpu` now emit Initializing→Running→Completed snapshots. Verified: `apr monitor checkpoints/albor-base-350m/` shows live TUI with loss curve, GPU name, tok/s, progress during CUDA 350M pretraining. (`entrenar@2ddc11c`)
ALB-046	—	entrenar	GPU telemetry all zeros in `training_state.json` — no live NVML/nvidia-smi data	High	FIXED	`query_gpu_telemetry()` shells out to `nvidia-smi --query-gpu` with CSV output, populates all GpuTelemetry fields. Wired into `write_training_snapshot()`. Verified: util=5%, VRAM=12.0G/24.0G, temp=41°C, power=94W/480W during 350M training (`entrenar@9b53c13`).
ALB-047	—	entrenar	TUI monitor hardcodes width=80, no terminal resize handling	Medium	FIXED	Replaced hand-rolled renderer with presentar-terminal `TuiApp`. Gets terminal resize detection for free from crossterm backend + presentar’s smart diffing. `TuiMonitorConfig.width/height` retained for headless mode only (`entrenar@9b53c13`).
ALB-048	—	entrenar	No signal handling in TUI monitor — Ctrl+C leaves cursor hidden	Medium	FIXED	presentar-terminal `TuiApp::run()` handles Ctrl+C/`q` with clean cursor restore, screen cleanup, and status message. No raw signal handlers needed — crossterm event loop + Drop impl (`entrenar@9b53c13`).
ALB-049	—	entrenar	No keyboard input in TUI monitor — can’t scroll/pause/interact	Low	FIXED	presentar-terminal `TuiApp` provides crossterm event loop with `q` quit and Ctrl+C. Scroll/pause deferred to presentar widget-level interaction (GpuPanel, LossCurve already support focus).
ALB-050	—	apr (aprender)	No `apr runs ls` — can’t list past training experiments	High	FIXED	`apr runs ls` reads local/global SQLite registry, shows table of runs with status, final loss, tok/s, duration. `apr runs show <id>` shows detailed metrics + hyperparameters. Supports `--global`, `--json`, `--status` filter. (`aprender@91641f2e`)
ALB-051	—	apr (aprender)	No run comparison — can’t overlay loss curves from two runs	Medium	FIXED	`apr runs diff <a> <b>` shows side-by-side comparison: inline sparklines, loss trajectory overlay, config diff (only changed params), final metric comparison with verdict (winner by final loss). Supports `--json` for LLM agents. (`aprender@9f9e9f63`)
ALB-052	—	entrenar	SQLite experiment tracking exists but not wired to pretraining	Medium	FIXED	`PretrainTracker` in `config/train/loader.rs` writes to both local and global SQLite stores. Uses existing `SqliteBackend` with `ExperimentStorage` trait. Logs experiment metadata, hyperparameters, and per-step metrics (loss, lr, tok/s). Best-effort — storage failures never block training. (`entrenar@daa0afc`)
ALB-053	—	entrenar	HeadlessOutput JSON missing fields present in TUI	High	FIXED	HeadlessOutput now has full field parity with TUI: `global_step`, `progress_percent`, `loss_history`, `lr_history`, `elapsed_seconds`, `optimizer_name`, `batch_size`, `model_path`, `checkpoint_path`, `executable_path`, `accuracy`, `samples_per_second`, `HeadlessSample`. `From<&TrainingSnapshot>` populates all fields. All 6 headless tests pass. (`entrenar@9b53c13`)
ALB-054	—	entrenar + apr	No multi-job monitoring — can’t watch multiple concurrent training runs	High	FIXED	`apr monitor` (no args) discovers active training runs from global SQLite registry (`~/.entrenar/experiments.db`). Checks for live `training_state.json` in registered output dirs. Lists active runs with experiment name, directory, run ID, start time. `apr monitor <dir>` attaches to specific run. Supports `--json` output for LLM agents. (`aprender@91641f2e`)
ALB-055	—	entrenar	No local SQLite experiment DB per training run	High	FIXED	`PretrainTracker` opens `<output_dir>/.entrenar/experiments.db` for local per-experiment metrics history. Logs experiment metadata, hyperparameters (task, model, optimizer, lr, epochs, batch_size, seq_len, max_steps, device), and per-step metrics (loss, lr, tok/s). All best-effort via `SqliteBackend`. (`entrenar@daa0afc`)
ALB-056	—	entrenar	No global SQLite experiment registry	High	FIXED	`PretrainTracker` opens `~/.entrenar/experiments.db` for global cross-machine experiment registry. Same schema as local: experiment + run + hyperparams + per-step metrics. `apr runs ls --global` reads it. `apr monitor` (no args) discovers active runs from it. (`entrenar@daa0afc`)
ALB-057	—	entrenar	Dashboard paints raw text instead of composing presentar widgets	Medium	FIXED	`TrainingDashboard` composes presentar-terminal widgets via `Layout::rows()`: `Border` for section panels, `Meter` for progress bar, `GpuPanel` for GPU telemetry (with `GpuDevice`/`GpuProcess` conversion from entrenar types), `Sparkline` for loss history, `Text` for info lines. Widget tree rebuilt each frame from snapshot. Panel verification wired into `Brick::verify()` via `layout_can_render()`. (`entrenar@0ad416e`)
ALB-058	—	apr (aprender)	`apr monitor --json` flag missing	Medium	FIXED	`apr monitor --json <dir>` streams headless JSON output with full TUI parity (ALB-053). `apr monitor --format text <dir>` for human-readable log lines. `--json` flag overrides `--format`. Routes to `HeadlessMonitor` for JSON/text, `TuiMonitor` for TUI. (`aprender@91641f2e`)
ALB-059	—	entrenar	GEMM backward constructor args n/k swapped — buffer overflow into optimizer states	Critical	FIXED	`GemmBackwardAKernel::tiled_unrolled(m, k, n, tile)` called with k and n swapped vs trueno constructor `(m, n, k, tile_size)`. Bakes wrong stride constants into PTX: output stride = vocab_size (32768) instead of hidden_size (512) for LM head backward. Rows overflow 64× into adjacent VRAM (m_w_k, v_w_k of block 0). Negative values in v_w_k → sqrt(negative) = NaN in AdamW. Same bug in backward_b. Also zero-initialized all optimizer m/v buffers (cuMemAlloc returns uninitialized VRAM). (`entrenar@846ae0c`)
ALB-060	—	entrenar / albor config	`epochs: 1` exhausts data before `max_steps` reached — 350M trains only 43/5000 steps	Critical	CONFIG FIXED	Root cause: 22K seqs, batch=4, accum=128 → 43 steps/epoch, max_steps=5000 unreachable. Fix: C-TRAINCFG-001 contract + v2 config (`pretrain-350m-v2.yaml`) with 68K seqs, accum=1, steps_per_epoch=16994 >= 5000. v1 config also fixed with epochs=117. V2 training partially completed (ALB-063).

| ALB-061 | #43 | albor docs | Monolithic spec stale — diverges from mdBook chapters | Medium | FIXED | scripts/generate-spec.sh regenerates docs/specifications/albor-llm-spec.md from mdBook chapters. make spec target added. | | ALB-062 | #44 | albor docs | Stale spec chapters — §3 VRAM, §15/18 blockers, §16 repro, model card, intro | Medium | FIXED | All chapters updated to match reality: VRAM budget, ALB-025/037 no longer blockers, v2 pipeline in §16, ALB-060 context in model card and introduction. | | ALB-063 | #45 | albor training | Retrain 350M with v2 config (corrected epochs + expanded data) | Critical | IN PROGRESS | ALB-069→072 all fixed. Training running: PID 1775202, ~4.4s/step (934 tok/s), save_interval=250, 5000 steps, ~11.8 GB VRAM. Loss 10.40→7.13 (step 169)→6.77 (step 338). Step 250 eval: val_loss=6.92, val_ppl=1008. Step 500 checkpoint verified OK (1520 MB). gnorm stable 2-9 range. | | ALB-064 | #46 | albor / entrenar | Training process dies silently — no crash detection, no watchdog, no recovery | Critical | FIXED | scripts/train-guard.sh: crash-resilient supervisor with exit code classification, GPU state capture, structured JSON crash reports, exponential backoff restart, heartbeat monitoring, pre-flight GPU health checks. Auto-diagnostic mode: detects async CUDA crash pattern, enables CUDA_LAUNCH_BLOCKING=1 on restart. Five Whys: CUDA driver crash → SIGABRT/SIGSEGV → bypasses Rust panic handler → no stderr output → no diagnosis. Root cause: ALB-065. | | ALB-065 | #47 | entrenar / trueno | Missing stream.synchronize() before D2H gradient transfers — async CUDA crash | Critical | FIXED | compute_workspace_clip_scale() and compute_clip_scale() call cuMemcpyDtoH without synchronizing the non-blocking CUDA stream. cuMemcpyDtoH only synchronizes with the default stream, but trueno creates streams with CU_STREAM_NON_BLOCKING. Result: backward kernels not finished when gradient buffers are read → garbage clip scale → NaN/crash. Fix: stream.synchronize() at 3 locations before D2H transfers (entrenar@d3a3d26). |

| ALB-066 | #48 | albor config | gradient_accumulation: 128 makes training take 68.8 days on single GPU | Critical | FIXED | CudaTransformerTrainer does per-sequence optimizer updates (per-block interleaved backward+optimize). gradient_accumulation just increases sequences per “step” without changing update granularity. Fix: reduced 128→16→1, epochs from 38→5→1. New estimate: ~11.7h at 480 tok/s. | | ALB-067 | #49 | entrenar / trueno | Per-block weight gradient clipping CPU bottleneck — 864 D2H transfers/step | High | FIXED (via ALB-078) | compute_workspace_clip_scale downloaded 9 buffers × 24 blocks × 4 seqs = 864 D2H transfers/step. Workaround: disabled per-block clipping (entrenar@eaadbc6). Proper fix: ALB-078 fused GPU clip pipeline (zero D2H, zero sync). grad_clip: 1.0 re-enabled in v3 config. | | ALB-068 | #50 | entrenar | save_interval dead code — no intermediate checkpoint saving during CUDA training | Critical | FIXED | save_interval read from config, validated, but never used in train_loop_cuda(). Checkpoints only saved at training completion. 24h crash = total loss. Fix: manual batch loop with trainer.save() at save_interval boundaries (entrenar@d8dfab7). | | ALB-069 | #51 | trueno | PTX selp_f32 argument order bug in fused cross-entropy kernels — training produces loss=0.0 | Critical | FIXED | selp_f32(pred, true_val, false_val) called as selp_f32(grad_target, grad_nontarget, is_target) — f32 values in pred slot, predicate in false_val slot. PTX JIT fails: “Arguments mismatch for instruction ‘selp’”. Same class as ALB-059 (constructor arg ordering). Fix: selp_f32(is_target, grad_target, grad_nontarget) at both call sites (trueno@10bec89, trueno#156). | | ALB-070 | #52 | entrenar / albor config | save_interval YAML field ignored — bridge reads checkpoint.save_every, default=1 causes eval every step | Critical | FIXED | YAML bridge reads training.checkpoint.save_every, not training.save_interval. Default=1 → validation eval runs every step → eval_batch() crashes on long sequences (missing max_seq_len truncation). Two fixes: (1) YAML config moved to checkpoint.save_every: 25 (2) eval_batch() now truncates to max_seq_len (entrenar@5c4c2d8). Same class as ALB-060 (config field mismatch). | | ALB-071 | #53 | entrenar | Embed gradient clipping disabled when grad_clip=None — NaN weights, loss=0.0 by step ~100 | Critical | FIXED | C-EMBED-GRAD-001 was gated behind if let Some(max_norm) = max_grad_norm. ALB-067 disabled grad_clip → embed activation gradients unclipped → CPU AdamW overflow → 304K NaN in embeddings, block weights ALL NaN. Fix: always clip with unwrap_or(1.0) + always compute LM head grad norm for observability (entrenar@d07d67d). Same class as ALB-044. | | ALB-072 | #54 | entrenar | fp16 loss scaling causes NaN in early layers — gradient overflow in f32 backward | Critical | FIXED | fp16 GradScaler (scale=65536) multiplied into fused CE kernel’s loss_scale. All backward uses f32 GpuBuffers — no fp16 underflow risk, but 65536x scaling caused activation gradient overflow by layers 0-1. Five Whys: loss=0.0 → NaN blocks 0-1 → first optimizer step NaN → FP32 works/FP16 doesn’t → unnecessary 65536x scaling. Fix: exclude grad_scaler.scale() from loss_scale (entrenar@44d3e74). gnorm now matches FP32 baseline (2.29). | | ALB-073 | #55 | trueno | fused_cross_entropy PTX selp argument mismatch — JIT compilation failure | High | FIXED | Same class as ALB-069. selp_f32(true_val, false_val, pred) instead of (pred, true_val, false_val) in fused cross-entropy kernel. Training fell back to non-fused path. Fix: trueno@10bec89. | | ALB-074 | #56 | entrenar | Buffer overflow — 2048-token seq hits 1024-sized GPU buffer during eval | Critical | FIXED | Stale binary missed ALB-070 eval truncation fix. 2048-token pretokenized sequence passed to eval_single_sequence without max_seq_len truncation → slice overflow at cuda_trainer.rs:711 (2096128 > 1048576). Crashed at step 1183. Fix: binary rebuild with entrenar@5c4c2d8. |

11.5 Performance Optimization Gaps

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-075	#57	trueno / entrenar	cuBLAS tensor core GEMM integration — replaced PTX GEMMs with TF32 tensor cores	Critical	FIXED	trueno-gpu 0.4.24 (cuBLAS FFI, PR #165 merged), entrenar PR #233 merged. Measured: 1,485 tok/s (4.3% MFU), 1,379ms/step, 3.19x end-to-end speedup. Kernel-level: 74-142 TFLOP/s vs 4.8-6.1 PTX (12-27x). Contract: `cublas-gemm-v1.yaml`.
ALB-076	#58	entrenar	Forward RMSNorm per-row kernel launch — 97.1% of GPU time	Critical	FIXED	`rms_norm_forward()` launched one 32-thread kernel per row (2048 launches/norm × 49 norms = 100,352 launches/step). nsys profiling: 46.6s/50 steps, avg 9.3μs each. Fix: switched to `BatchedVectorizedRmsNormKernel` (single launch, 256 threads, `blockIdx.y` batch dispatch). entrenar PR #238 merged. Measured: forward 347ms→14ms (24.8×), step 1357ms→339ms (4×), MFU 4.4%→17.5% (4×).
ALB-077	trueno #170, entrenar #239	trueno / entrenar	cuBLAS tensor core GEMM produces NaN for transposed backward GEMMs	Critical	FIXED	`CUBLAS_GEMM_DEFAULT_TENSOR_OP` outputs ALL NaN for Trans/NoTrans and NoTrans/Trans operations when gradient magnitudes reach ~1e5 (block 18 of 24-layer backward). Forward NoTrans/NoTrans unaffected. Five Whys: gradient magnification through 24 layers triggers undocumented tensor core numerical fault. Fix: `CUBLAS_DEFAULT_MATH` + `CUBLAS_COMPUTE_32F` + `CUBLAS_GEMM_DEFAULT` (no tensor cores, SIMD path). Phase 5a (TF32) reverted. Measured: 5,216 tok/s (15.1% MFU), 5.9× over PTX baseline, 0 NaN.

| ALB-078 | trueno #171, entrenar #240 | trueno / entrenar | Fused GPU gradient clipping — eliminate 26 stream syncs/step | High | IMPLEMENTED | Per-block clip calls stream.synchronize() + D2H 24×/step. New kernels: ClipScaleReduceKernel (single-CTA norm+clip_scale on GPU), GradientClipGpuScaleKernel (element-wise clip reading scale from GPU memory). Pipeline: 9× squared_sum_launch_into → 1× clip_scale_reduce → 9× gradient_clip_gpu_scale. Zero sync, zero D2H. IEEE 754 handles zero-norm (div→+inf, min→1.0). Compiles, awaiting dogfood. Expected: ~20% step time reduction. |

11.6 Training Quality Gaps

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-079	entrenar #241	entrenar	CUDA trainer ignores lr_scheduler — constant lr after warmup	Critical	FIXED	`CudaTransformerTrainer::current_lr()` only had linear warmup; returned constant `base_lr` after warmup. YAML `lr_scheduler: "cosine"` parsed but never applied. Five Whys: val_loss plateau at 6.92 + gnorm collapse 3.0→0.13 at constant lr. Fix: cosine decay using `max_steps` + `set_lr()` for CPU embed optimizer (`entrenar@297308d`, PR #241). v4 training launched with cosine decay active.
ALB-080	albor #61	albor config	Effective batch size 48-128x too small for 350M training	Critical	FIXED	4,096 tokens/step vs comparable runs: CodeParrot-small 196K, GPT-2 524K. Root cause: `gradient_accumulation: 1` in v3 config. Fix: v4 config with `gradient_accumulation: 32` → 131K tokens/step. Same wall-clock, 32x better gradient quality. Target: val_ppl < 100 by 1B tokens. v3 stopped at step 28K (val_ppl=1018, plateau); v4 launched with both fixes.

11.7 Data Pipeline Gaps

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-081	aprender#418, realizar#136	aprender	Streaming APR import + mmap reader — eliminate OOM on large models	Critical	FIXED	`apr import` loaded entire 67GB model into RAM (134GB as F32) → swap storm. `apr tensors` loaded entire .apr into `Vec<u8>` → 89GB RSS. Five Whys: no streaming write path, no mmap read path. Fix: `AprV2StreamingWriter` (temp file, peak RAM ~5GB), `MappedFile` + `AprV2ReaderRef` for reading (10.9MB RSS on 67GB file). Contract: `streaming-reader-v1.yaml`, FALSIFY-MMAP-001 verified.

11.8 Observability Gaps

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-082	entrenar#246	entrenar	Scaling law predictor — early convergence ceiling detection	High	FIXED	Fits Kaplan scaling law `L(D) = a - b × ln(D)` to eval checkpoints via OLS after 3+ data points. Predicts val_ppl at max_steps and warns if improvement < 10%. Would have flagged v4 plateau 20 GPU-hours earlier. Contract: `scaling-law-prediction-v1.yaml`. Implementation: entrenar PR #247 merged.
ALB-083	albor#63	albor	Data pipeline expansion — ingest CodeSearchNet Python	Medium	IN PROGRESS	CodeSearchNet Python downloaded (455K functions, 133M tokens). Pretokenized to 2048-length sequences (65K seqs). Merged with original data → 180M tokens total. v4 actually used pretokenized-1024-v3 (5.3B tokens from codeparrot-clean-2M), so data wasn’t the bottleneck — insufficient training steps was.

11.9 Evaluation Gaps

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-084	albor#64	apr (aprender)	HumanEval pass@k evaluation — wire inference into apr eval	Critical	FIXED	`apr eval --task humaneval --data humaneval.jsonl` loads SafeTensors model via realizar, generates completions with `forward_with_cache`, truncates at function boundary, executes Python tests with timeout, reports pass@k. Contract: `eval-humaneval-v1.yaml`. Implementation: aprender PR #429 merged (`aprender@a7b1da8c`). Temperature sampling, per_problem_results JSON output. Verified end-to-end on v4 checkpoint.
ALB-085	albor#65	apr (aprender)	MBPP benchmark evaluation	High	FIXED	`run_mbpp()` in eval.rs. 974 problems, text→completion→test_list execution. Contract: `eval-mbpp-v1.yaml`. Reuses ALB-084 inference bridge (SafetensorsToAprConverter + forward_with_cache + execute_python_test). max_new_tokens=512, timeout=10s.
ALB-086	albor#66	entrenar	SafeTensors checkpoint saves 1D shapes — HuggingFace incompatible	Medium	FIXED	Contract falsification found: `save_safetensors()` saves all tensors as 1D `[N]` instead of 2D `[out, in]`. Fix: `infer_all_tensor_shapes()` derives proper shapes from norm weights + element count. entrenar PR #255 merged. Contract: `checkpoint-inference-bridge-v1.yaml`.
ALB-087	albor#67	entrenar	Automatic eval scheduling + best-model checkpoint tracking	High	FIXED	entrenar PR #254 merged. `eval_interval` + `patience` in TrainingParams, decoupled eval from save, best-model tracking (model-best.safetensors), early stopping. Will activate in v5 training with updated config.
ALB-088	albor#68	apr (aprender)	Multi-sample pass@k evaluation (n samples per problem)	High	FIXED	aprender PR #432 merged. `--samples N --temperature T` flags, unbiased pass@k estimator (Chen et al. 2021). Contract: `multi-sample-passk-v1.yaml`. Will dogfood on v5 checkpoint.
ALB-089	albor#69	entrenar/apr	GPU-accelerated inference for eval (CUDA forward pass)	High	DOGFOODING	`--device cuda` wired into `apr eval --task humaneval/mbpp`. Uses `CudaTransformerTrainer::for_inference()` + `forward_logits()`. No KV cache yet (O(n²) but still 20-40x faster than CPU). Awaiting dogfood when GPU is free from training.

11.10 Training Infrastructure Gaps

ID	Issue	Component	Gap	Severity	Status	Acceptance Criterion
ALB-091	—	entrenar	GPU-resident gradient accumulation — D2H bottleneck kills ga>1 throughput	Critical	FIXED	`GpuGradientAccumulator` accumulates gradients in GPU memory via `inplace_add_gpu()` (ResidualAddKernel). Zero D2H during micro-batch loop, ONE stream sync per optimizer step. Dogfooded: ga=8, batch=4 → 8.2K tok/s (23.7% MFU) vs previous CPU-side ga: 2.9K tok/s. VRAM cost: 1,520 MB for 350M model.

Gaps are added as they are discovered during implementation and dogfooding.

Keyboard shortcuts

Albor LLM Specification