Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

11. Gap Register

Every gap discovered during development is tracked here. Each gap maps to a specific upstream component, a GitHub issue, and a clear acceptance criterion.

Lifecycle: Gap discovered → GitHub issue filed → implemented upstream → wired into apr → dogfooded in albor pipeline → FALSIFY/pmat verified → closed.

StatusMeaning
OPENGap identified, not yet implemented
IN PROGRESSGitHub issue filed, work underway
DOGFOODINGImplemented, being validated in albor pipeline
CLOSEDVerified working end-to-end, issue closed

11.1 Critical Path Gaps (Block the Improvement Ladder)

IDIssueComponentGapSeverityStatusAcceptance Criterion
ALB-001#6apr (aprender)apr tokenize plan/apply subcommandMediumFIXEDapr tokenize plan validates inputs + estimates time; apr tokenize apply trains BPE/WordPiece/Unigram tokenizer (aprender@90427205). Writes vocab.json + merges.txt.
ALB-006#7apr (aprender)apr eval plan/apply benchmark harnessHighFIXEDapr eval --task code --data benchmark.jsonl evaluates code completion with pass@1 scoring. apr eval --task plan validates model + data exist. JSONL format with prompt/test/canonical_solution. Phase 1: structural validation. Phase 2: full inference (ALB-009 prerequisite). (aprender@4e61297e)
ALB-007#8entrenarParquet→LMBatch bridge via alimentarMediumFIXEDload_lm_batches_from_parquet() reads text or pre-tokenized Parquet (single file or directory of shards) via alimentar. Text columns tokenized with HfTokenizer. Column auto-detection (input_ids/token_ids for pre-tokenized, text/content/code for text). Gated behind parquet feature. (entrenar@a5a2fb7)
ALB-009#1apr (entrenar)apr train plan/apply for pre-training from scratchCriticalFIXEDapr train plan --task pretrain --config <yaml> validates config via entrenar, shows model architecture and training params. apr train apply --task pretrain --config <yaml> runs full pre-training via train_from_yaml() (TransformerTrainer + CausalLMLoss). Config updated to match entrenar TrainSpec schema. (aprender@d79ed943)
ALB-010#2realizarQwen3.5-35B-A3B MoE inference (teacher for distillation)CriticalDOGFOODINGSteps 1-5b MERGED (PR #133): types, router, expert dispatch, forward integration, shared expert gate, architecture registration, config fields. Step 6 (PR #135): SafeTensors MoE weight loading — detect_model_prefix (ConditionalGeneration wrapper), extract_layer_generic_with_prefix, load_moe_weights (router, packed experts, shared expert), GPU adapter wiring. 15,054 tests pass. Remaining: end-to-end dogfood with Qwen3.5-35B-A3B model files.
ALB-011#3apr (entrenar + realizar)apr distill plan/apply (precompute + train stages)CriticalFIXEDapr distill --config <yaml> --plan validates config, shows teacher/student/training params. apr distill --config <yaml> --stage precompute inspects teacher, writes manifest. apr distill --config <yaml> --stage train validates precompute manifest, sets up KD training. Local DistillYamlConfig matches entrenar schema. (aprender@81dd4432)
ALB-018#19entrenar/alimentarFill-in-the-Middle (FIM) data transform (PSM/SPM)HighFIXEDalimentar fim transform with PSM/SPM formats, configurable rate/seed (alimentar@290582d). Fim struct implements Transform trait for pipeline integration.
ALB-019#20alimentaralimentar import local for local Python filesMediumFIXEDalimentar import local subcommand now available (alimentar@265541b). Supports CSV/JSON/JSONL/Parquet format conversion.
ALB-020#21alimentaralimentar mix with weighted upsamplingMediumFIXEDalimentar mix with weighted sampling and upsampling now available (alimentar@64b1e92). Syntax: alimentar mix a.parquet:0.8 b.parquet:0.2 -o out.parquet.
ALB-021#22entrenarCustom model architecture params in YAMLHighFIXEDArchitectureOverrides struct carries YAML manifest architecture: params through bridge converter to TransformerConfig. Supports all fields: hidden_size, num_layers, num_heads, num_kv_heads, intermediate_size, vocab_size, max_seq_length, rms_norm_eps, rope_theta, use_bias. (entrenar@a414861)
ALB-022#23entrenarHuman-readable value shorthand in YAML configsLowFIXEDparse_human_usize() and deserialize_human_usize_opt support SI suffixes (32K, 1M, 10B, 1T), scientific notation (1e6), and fractional suffixes (1.5K). Applied to ArchitectureConfig and DataConfig fields. (entrenar@1cb0950)
ALB-023#24apr (aprender)Plan/apply contract for all subcommandsHighFIXEDEvery apr <cmd> action command now exposes plan mode: merge --plan, export --plan, publish --plan added to join existing train plan/apply, tokenize plan/apply, quantize --plan, finetune --plan, prune --plan, distill --plan, eval --task plan. Pre-dispatch contract validation skipped in plan mode. (aprender@526a1e4b)
ALB-024#25apr (aprender)apr experiment view — interactive SQLite experiment browserMediumFIXEDapr experiment view --global opens ratatui TUI with run table, sparkline, and braille loss chart. --json mode for CI. Reads local or global ~/.entrenar/experiments.db. (aprender@1196d244)
ALB-025#26presentar + aprapr monitor upgrade — presentar widgets for live training TUIMediumFIXEDTrainingDashboard composes presentar-terminal Meter, GpuPanel, Sparkline, Text, Border, Layout (ALB-057). TuiApp handles resize/Ctrl+C/diffing (ALB-047/048). WASM compilation deferred to ALB-026. (entrenar@0ad416e)
ALB-026#27presentarWASM training dashboard — albor-dashboard.yamlMediumOPENDeclarative YAML dashboard config that renders training metrics, experiment comparison, and model card via presentar serve. Embeddable in HuggingFace model card as static WASM artifact.
ALB-027#4forjartask resource type for pipeline orchestrationCriticalFIXEDNew forjar resource type: runs arbitrary command, tracks exit code, hashes output_artifacts for idempotency via b3sum, supports completion_check and timeout. Handlers: check_script (completion_check or artifact existence), apply_script (set -euo pipefail, working_dir, timeout), state_query_script (b3sum artifacts). Validation: command required, timeout > 0. (forjar@d14e633)
ALB-028#5apr (aprender)apr pipeline plan/apply wrapping forjar DAG engineCriticalFIXEDapr pipeline plan shows full DAG with 23 resources across 2 machines. apr pipeline apply converges via forjar engine. apr pipeline status shows state. apr pipeline validate checks manifest. Shells out to forjar binary (decoupled). (aprender@e653d5ca)

11.2 Distributed Training Gaps (Stretch / Future)

IDIssueComponentGapSeverityStatusAcceptance Criterion
ALB-002#9repartirRing all-reduce implementationHighOPENGradient tensors synchronized across 2+ workers with <5% overhead
ALB-003#10entrenarrepartir integration for distributed trainingHighOPENTraining loop calls repartir::GradientSync for multi-worker training
ALB-004#11entrenarUnified CUDA + wgpu backend dispatchMediumOPENSame training config runs on CUDA (4090) and wgpu (W5700X)
ALB-005#12truenowgpu backward pass (gradient WGSL shaders)HighOPENCompute shaders for matmul_backward, gelu_backward, rmsnorm_backward, attention_backward
ALB-008#13repartirHeterogeneous worker throughput balancingMediumOPENWorkers with different GPU speeds get proportional workload

11.3 Quality & Verification Gaps

IDIssueComponentGapSeverityStatusAcceptance Criterion
ALB-013#14provable-contractsKnowledge distillation contractHighDOGFOODINGknowledge-distillation-kernel-v1.yaml — committed and passes pv validate. 3 equations, 6 obligations, 5 falsification tests, 2 Kani harnesses. Needs binding to entrenar implementation.
ALB-014#15provable-contractsBPE tokenizer contractMediumDOGFOODINGbpe-tokenizer-kernel-v1.yaml — committed and passes pv validate. Roundtrip invariant, FIM sentinel tests. Needs binding to aprender BPE.
ALB-015#16provable-contractsModel merging contract (SLERP, TIES, DARE)MediumDOGFOODINGmodel-merging-kernel-v1.yaml — committed and passes pv validate. SLERP bound, DARE unbiased estimator. Needs binding.
ALB-016#17provable-contractsPruning contract (WANDA, magnitude)MediumDOGFOODINGpruning-kernel-v1.yaml — committed and passes pv validate. Sparsity invariant, score ordering. Needs binding.
ALB-017#18provable-contractsGradient accumulation contractHighDOGFOODINGgradient-accumulation-kernel-v1.yaml — committed and passes pv validate. Numerical equivalence, gradient zeroing. Needs binding.

Contract coverage report (pv coverage contracts): 8 contracts, 31 equations, 51 obligations, 34 falsification tests, 10 Kani harnesses, 100% obligation coverage. All contracts at impl=0/N — waiting for upstream bindings.

11.4 Dogfooding-Discovered Gaps

IDIssueComponentGapSeverityStatusAcceptance Criterion
ALB-029#28batutabatuta falsify false positives on project reposMediumFIXEDFixed upstream in batuta@905a862: AI-01 searches configs/, AI-04 excludes book-output/, AI-05 detects pv/forjar validation. Score: 72.2% → 73.1%.
ALB-030#29batutabatuta stack status fails without Cargo.tomlLowFIXEDFixed upstream in batuta@371557a: Falls back to binary detection, discovers 11 installed PAIML tools with versions.
ALB-031#30batutabatuta hf search returns mock/placeholder dataLowOPENbatuta hf search model "code completion" returns live HuggingFace Hub results instead of placeholder models.
ALB-033#31apr (aprender)apr tokenize → entrenar tokenizer.json format gapMediumDOGFOODINGapr tokenize apply produces vocab.json + merges.txt but entrenar expects HuggingFace tokenizer.json. Workaround: Python tokenizers lib.
ALB-034#32entrenarmax_steps config not respected in training loopMediumFIXEDmax_steps wired through YAML manifest → bridge → TrainingParams → TransformerTrainConfig → trainer loop. Training stops when optimizer step count reaches limit (entrenar@07db101).
ALB-035#33entrenarDoes not write training_state.json during trainingMediumFIXEDAdded train_epoch_with_callback() and per-step logging (~100 lines/epoch) in entrenar@5d41a96.
ALB-036#34apr (aprender)BPE tokenizer normalizes whitespaceMediumDOGFOODINGsplit_whitespace() pre-tokenizer destroys Python indentation. Workaround: ByteLevel BPE v2.
ALB-037#35realizarSafeTensors inference ignores loaded weightsHighFIXEDRoot cause chain: ALB-038 (no gradient flow) → ALB-043 (backward_ffn buffer overflow + wrong SwiGLU gradients). Secondary: entrenar didn’t save config.json (entrenar@6097780). Verified e2e: realizar run loads 350M trained checkpoint (218 tensors), generates tokens from learned weights.
ALB-038#36entrenarSaves initialization weights, not trained weightsCriticalFIXEDRoot cause: RMSNorm::forward_batched() created tensors with no backward op, blocking all gradient flow. Attention forward() also broke Q/K/V gradients. Fixed in entrenar@91ba9da (norm backward) and entrenar@1ede409 (attention backward). All 20 model parameters now receive gradients.
ALB-040#38entrenarGPU-resident pretraining — wire CudaTransformerBlock into TransformerTrainerCriticalVERIFIEDCudaTransformerTrainer in cuda_trainer.rs follows classify_pipeline.rs pattern. 3 PCIe transfers/step vs 16K. Auto-detect CUDA with graceful CPU fallback. Contract: training-gpu-kernel-v1.yaml. 350M verified: 50-step test loss 10.39→6.07, checkpoint valid, realizar loads + generates. Full training running (seq=1024, batch=4, accum=128).
ALB-041#39entrenarD2D buffer size mismatch in CudaTransformerBlock backward_attentionHighFIXEDbackward_attention() used gate_out (intermediate_size) as temp buffer for grad_hidden accumulation, but D2D copy requires exact size match. Fixed: use o_proj_out (hidden_size). Also added seq_len truncation and error logging in CudaTransformerTrainer. (entrenar@a48e3d2)
ALB-042#40entrenarCudaTransformerTrainer runtime errors → silent loss=0.0 instead of CPU fallbackMediumOPENWhen CUDA operations fail during training (e.g., VRAM contention), trainer should detect N consecutive failures and gracefully fall back to CPU mode. Currently reports loss=0.0 and saves garbage checkpoint. Workaround: CUDA_VISIBLE_DEVICES="".
ALB-043#41entrenarbackward_ffn buffer overflow + missing SwiGLU gradientsCriticalFIXEDTwo bugs: (1) silu_backward wrote [S,I] output into [S,H] buffer (4× overflow → CUDA_ERROR_ILLEGAL_ADDRESS). (2) SwiGLU backward missing ×up factor in gate gradient; grad_up/grad_w_up completely absent (w_up never trained). Fixed with correct 10-step decomposition using elementwise_mul_forward, silu_forward, silu_backward. (entrenar@f7805f1)
ALB-044#42entrenarUnclipped activation gradients + CPU optimizer hyperparameter mismatch cause 350M NaNCriticalFIXEDTwo bugs: (1) Activation gradient from block[0] backward (~1e35) unclipped — per-block clipping only applies to weight gradients in CudaGradWorkspace. (2) CPU AdamW used default_params(lr) (β₂=0.999, wd=0.01) instead of YAML config (β₂=0.95, wd=0.1) — 50× bias correction amplification overflows f32. Fixed: C-EMBED-GRAD-001 clips activation gradient before scatter-add; CPU optimizer matches YAML hyperparams. 350M now trains without NaN.
ALB-045entrenartrain_loop_cuda does not write training_state.jsonapr monitor blind to pretrainingCriticalFIXEDwrite_training_snapshot() helper in src/config/train/loader.rs writes TrainingSnapshot to training_state.json on every log interval. Both train_loop_cuda and train_loop_cpu now emit Initializing→Running→Completed snapshots. Verified: apr monitor checkpoints/albor-base-350m/ shows live TUI with loss curve, GPU name, tok/s, progress during CUDA 350M pretraining. (entrenar@2ddc11c)
ALB-046entrenarGPU telemetry all zeros in training_state.json — no live NVML/nvidia-smi dataHighFIXEDquery_gpu_telemetry() shells out to nvidia-smi --query-gpu with CSV output, populates all GpuTelemetry fields. Wired into write_training_snapshot(). Verified: util=5%, VRAM=12.0G/24.0G, temp=41°C, power=94W/480W during 350M training (entrenar@9b53c13).
ALB-047entrenarTUI monitor hardcodes width=80, no terminal resize handlingMediumFIXEDReplaced hand-rolled renderer with presentar-terminal TuiApp. Gets terminal resize detection for free from crossterm backend + presentar’s smart diffing. TuiMonitorConfig.width/height retained for headless mode only (entrenar@9b53c13).
ALB-048entrenarNo signal handling in TUI monitor — Ctrl+C leaves cursor hiddenMediumFIXEDpresentar-terminal TuiApp::run() handles Ctrl+C/q with clean cursor restore, screen cleanup, and status message. No raw signal handlers needed — crossterm event loop + Drop impl (entrenar@9b53c13).
ALB-049entrenarNo keyboard input in TUI monitor — can’t scroll/pause/interactLowFIXEDpresentar-terminal TuiApp provides crossterm event loop with q quit and Ctrl+C. Scroll/pause deferred to presentar widget-level interaction (GpuPanel, LossCurve already support focus).
ALB-050apr (aprender)No apr runs ls — can’t list past training experimentsHighFIXEDapr runs ls reads local/global SQLite registry, shows table of runs with status, final loss, tok/s, duration. apr runs show <id> shows detailed metrics + hyperparameters. Supports --global, --json, --status filter. (aprender@91641f2e)
ALB-051apr (aprender)No run comparison — can’t overlay loss curves from two runsMediumFIXEDapr runs diff <a> <b> shows side-by-side comparison: inline sparklines, loss trajectory overlay, config diff (only changed params), final metric comparison with verdict (winner by final loss). Supports --json for LLM agents. (aprender@9f9e9f63)
ALB-052entrenarSQLite experiment tracking exists but not wired to pretrainingMediumFIXEDPretrainTracker in config/train/loader.rs writes to both local and global SQLite stores. Uses existing SqliteBackend with ExperimentStorage trait. Logs experiment metadata, hyperparameters, and per-step metrics (loss, lr, tok/s). Best-effort — storage failures never block training. (entrenar@daa0afc)
ALB-053entrenarHeadlessOutput JSON missing fields present in TUIHighFIXEDHeadlessOutput now has full field parity with TUI: global_step, progress_percent, loss_history, lr_history, elapsed_seconds, optimizer_name, batch_size, model_path, checkpoint_path, executable_path, accuracy, samples_per_second, HeadlessSample. From<&TrainingSnapshot> populates all fields. All 6 headless tests pass. (entrenar@9b53c13)
ALB-054entrenar + aprNo multi-job monitoring — can’t watch multiple concurrent training runsHighFIXEDapr monitor (no args) discovers active training runs from global SQLite registry (~/.entrenar/experiments.db). Checks for live training_state.json in registered output dirs. Lists active runs with experiment name, directory, run ID, start time. apr monitor <dir> attaches to specific run. Supports --json output for LLM agents. (aprender@91641f2e)
ALB-055entrenarNo local SQLite experiment DB per training runHighFIXEDPretrainTracker opens <output_dir>/.entrenar/experiments.db for local per-experiment metrics history. Logs experiment metadata, hyperparameters (task, model, optimizer, lr, epochs, batch_size, seq_len, max_steps, device), and per-step metrics (loss, lr, tok/s). All best-effort via SqliteBackend. (entrenar@daa0afc)
ALB-056entrenarNo global SQLite experiment registryHighFIXEDPretrainTracker opens ~/.entrenar/experiments.db for global cross-machine experiment registry. Same schema as local: experiment + run + hyperparams + per-step metrics. apr runs ls --global reads it. apr monitor (no args) discovers active runs from it. (entrenar@daa0afc)
ALB-057entrenarDashboard paints raw text instead of composing presentar widgetsMediumFIXEDTrainingDashboard composes presentar-terminal widgets via Layout::rows(): Border for section panels, Meter for progress bar, GpuPanel for GPU telemetry (with GpuDevice/GpuProcess conversion from entrenar types), Sparkline for loss history, Text for info lines. Widget tree rebuilt each frame from snapshot. Panel verification wired into Brick::verify() via layout_can_render(). (entrenar@0ad416e)
ALB-058apr (aprender)apr monitor --json flag missingMediumFIXEDapr monitor --json <dir> streams headless JSON output with full TUI parity (ALB-053). apr monitor --format text <dir> for human-readable log lines. --json flag overrides --format. Routes to HeadlessMonitor for JSON/text, TuiMonitor for TUI. (aprender@91641f2e)
ALB-059entrenarGEMM backward constructor args n/k swapped — buffer overflow into optimizer statesCriticalFIXEDGemmBackwardAKernel::tiled_unrolled(m, k, n, tile) called with k and n swapped vs trueno constructor (m, n, k, tile_size). Bakes wrong stride constants into PTX: output stride = vocab_size (32768) instead of hidden_size (512) for LM head backward. Rows overflow 64× into adjacent VRAM (m_w_k, v_w_k of block 0). Negative values in v_w_k → sqrt(negative) = NaN in AdamW. Same bug in backward_b. Also zero-initialized all optimizer m/v buffers (cuMemAlloc returns uninitialized VRAM). (entrenar@846ae0c)
ALB-060entrenar / albor configepochs: 1 exhausts data before max_steps reached — 350M trains only 43/5000 stepsCriticalCONFIG FIXEDRoot cause: 22K seqs, batch=4, accum=128 → 43 steps/epoch, max_steps=5000 unreachable. Fix: C-TRAINCFG-001 contract + v2 config (pretrain-350m-v2.yaml) with 68K seqs, accum=1, steps_per_epoch=16994 >= 5000. v1 config also fixed with epochs=117. V2 training partially completed (ALB-063).

| ALB-061 | #43 | albor docs | Monolithic spec stale — diverges from mdBook chapters | Medium | FIXED | scripts/generate-spec.sh regenerates docs/specifications/albor-llm-spec.md from mdBook chapters. make spec target added. | | ALB-062 | #44 | albor docs | Stale spec chapters — §3 VRAM, §15/18 blockers, §16 repro, model card, intro | Medium | FIXED | All chapters updated to match reality: VRAM budget, ALB-025/037 no longer blockers, v2 pipeline in §16, ALB-060 context in model card and introduction. | | ALB-063 | #45 | albor training | Retrain 350M with v2 config (corrected epochs + expanded data) | Critical | IN PROGRESS | ALB-069→072 all fixed. Training running: PID 1775202, ~4.4s/step (934 tok/s), save_interval=250, 5000 steps, ~11.8 GB VRAM. Loss 10.40→7.13 (step 169)→6.77 (step 338). Step 250 eval: val_loss=6.92, val_ppl=1008. Step 500 checkpoint verified OK (1520 MB). gnorm stable 2-9 range. | | ALB-064 | #46 | albor / entrenar | Training process dies silently — no crash detection, no watchdog, no recovery | Critical | FIXED | scripts/train-guard.sh: crash-resilient supervisor with exit code classification, GPU state capture, structured JSON crash reports, exponential backoff restart, heartbeat monitoring, pre-flight GPU health checks. Auto-diagnostic mode: detects async CUDA crash pattern, enables CUDA_LAUNCH_BLOCKING=1 on restart. Five Whys: CUDA driver crash → SIGABRT/SIGSEGV → bypasses Rust panic handler → no stderr output → no diagnosis. Root cause: ALB-065. | | ALB-065 | #47 | entrenar / trueno | Missing stream.synchronize() before D2H gradient transfers — async CUDA crash | Critical | FIXED | compute_workspace_clip_scale() and compute_clip_scale() call cuMemcpyDtoH without synchronizing the non-blocking CUDA stream. cuMemcpyDtoH only synchronizes with the default stream, but trueno creates streams with CU_STREAM_NON_BLOCKING. Result: backward kernels not finished when gradient buffers are read → garbage clip scale → NaN/crash. Fix: stream.synchronize() at 3 locations before D2H transfers (entrenar@d3a3d26). |

| ALB-066 | #48 | albor config | gradient_accumulation: 128 makes training take 68.8 days on single GPU | Critical | FIXED | CudaTransformerTrainer does per-sequence optimizer updates (per-block interleaved backward+optimize). gradient_accumulation just increases sequences per “step” without changing update granularity. Fix: reduced 128→16→1, epochs from 38→5→1. New estimate: ~11.7h at 480 tok/s. | | ALB-067 | #49 | entrenar / trueno | Per-block weight gradient clipping CPU bottleneck — 864 D2H transfers/step | High | FIXED (via ALB-078) | compute_workspace_clip_scale downloaded 9 buffers × 24 blocks × 4 seqs = 864 D2H transfers/step. Workaround: disabled per-block clipping (entrenar@eaadbc6). Proper fix: ALB-078 fused GPU clip pipeline (zero D2H, zero sync). grad_clip: 1.0 re-enabled in v3 config. | | ALB-068 | #50 | entrenar | save_interval dead code — no intermediate checkpoint saving during CUDA training | Critical | FIXED | save_interval read from config, validated, but never used in train_loop_cuda(). Checkpoints only saved at training completion. 24h crash = total loss. Fix: manual batch loop with trainer.save() at save_interval boundaries (entrenar@d8dfab7). | | ALB-069 | #51 | trueno | PTX selp_f32 argument order bug in fused cross-entropy kernels — training produces loss=0.0 | Critical | FIXED | selp_f32(pred, true_val, false_val) called as selp_f32(grad_target, grad_nontarget, is_target) — f32 values in pred slot, predicate in false_val slot. PTX JIT fails: “Arguments mismatch for instruction ‘selp’”. Same class as ALB-059 (constructor arg ordering). Fix: selp_f32(is_target, grad_target, grad_nontarget) at both call sites (trueno@10bec89, trueno#156). | | ALB-070 | #52 | entrenar / albor config | save_interval YAML field ignored — bridge reads checkpoint.save_every, default=1 causes eval every step | Critical | FIXED | YAML bridge reads training.checkpoint.save_every, not training.save_interval. Default=1 → validation eval runs every step → eval_batch() crashes on long sequences (missing max_seq_len truncation). Two fixes: (1) YAML config moved to checkpoint.save_every: 25 (2) eval_batch() now truncates to max_seq_len (entrenar@5c4c2d8). Same class as ALB-060 (config field mismatch). | | ALB-071 | #53 | entrenar | Embed gradient clipping disabled when grad_clip=None — NaN weights, loss=0.0 by step ~100 | Critical | FIXED | C-EMBED-GRAD-001 was gated behind if let Some(max_norm) = max_grad_norm. ALB-067 disabled grad_clip → embed activation gradients unclipped → CPU AdamW overflow → 304K NaN in embeddings, block weights ALL NaN. Fix: always clip with unwrap_or(1.0) + always compute LM head grad norm for observability (entrenar@d07d67d). Same class as ALB-044. | | ALB-072 | #54 | entrenar | fp16 loss scaling causes NaN in early layers — gradient overflow in f32 backward | Critical | FIXED | fp16 GradScaler (scale=65536) multiplied into fused CE kernel’s loss_scale. All backward uses f32 GpuBuffers — no fp16 underflow risk, but 65536x scaling caused activation gradient overflow by layers 0-1. Five Whys: loss=0.0 → NaN blocks 0-1 → first optimizer step NaN → FP32 works/FP16 doesn’t → unnecessary 65536x scaling. Fix: exclude grad_scaler.scale() from loss_scale (entrenar@44d3e74). gnorm now matches FP32 baseline (2.29). | | ALB-073 | #55 | trueno | fused_cross_entropy PTX selp argument mismatch — JIT compilation failure | High | FIXED | Same class as ALB-069. selp_f32(true_val, false_val, pred) instead of (pred, true_val, false_val) in fused cross-entropy kernel. Training fell back to non-fused path. Fix: trueno@10bec89. | | ALB-074 | #56 | entrenar | Buffer overflow — 2048-token seq hits 1024-sized GPU buffer during eval | Critical | FIXED | Stale binary missed ALB-070 eval truncation fix. 2048-token pretokenized sequence passed to eval_single_sequence without max_seq_len truncation → slice overflow at cuda_trainer.rs:711 (2096128 > 1048576). Crashed at step 1183. Fix: binary rebuild with entrenar@5c4c2d8. |

11.5 Performance Optimization Gaps

IDIssueComponentGapSeverityStatusAcceptance Criterion
ALB-075#57trueno / entrenarcuBLAS tensor core GEMM integration — replaced PTX GEMMs with TF32 tensor coresCriticalFIXEDtrueno-gpu 0.4.24 (cuBLAS FFI, PR #165 merged), entrenar PR #233 merged. Measured: 1,485 tok/s (4.3% MFU), 1,379ms/step, 3.19x end-to-end speedup. Kernel-level: 74-142 TFLOP/s vs 4.8-6.1 PTX (12-27x). Contract: cublas-gemm-v1.yaml.
ALB-076#58entrenarForward RMSNorm per-row kernel launch — 97.1% of GPU timeCriticalFIXEDrms_norm_forward() launched one 32-thread kernel per row (2048 launches/norm × 49 norms = 100,352 launches/step). nsys profiling: 46.6s/50 steps, avg 9.3μs each. Fix: switched to BatchedVectorizedRmsNormKernel (single launch, 256 threads, blockIdx.y batch dispatch). entrenar PR #238 merged. Measured: forward 347ms→14ms (24.8×), step 1357ms→339ms (4×), MFU 4.4%→17.5% (4×).
ALB-077trueno #170, entrenar #239trueno / entrenarcuBLAS tensor core GEMM produces NaN for transposed backward GEMMsCriticalFIXEDCUBLAS_GEMM_DEFAULT_TENSOR_OP outputs ALL NaN for Trans/NoTrans and NoTrans/Trans operations when gradient magnitudes reach ~1e5 (block 18 of 24-layer backward). Forward NoTrans/NoTrans unaffected. Five Whys: gradient magnification through 24 layers triggers undocumented tensor core numerical fault. Fix: CUBLAS_DEFAULT_MATH + CUBLAS_COMPUTE_32F + CUBLAS_GEMM_DEFAULT (no tensor cores, SIMD path). Phase 5a (TF32) reverted. Measured: 5,216 tok/s (15.1% MFU), 5.9× over PTX baseline, 0 NaN.

| ALB-078 | trueno #171, entrenar #240 | trueno / entrenar | Fused GPU gradient clipping — eliminate 26 stream syncs/step | High | IMPLEMENTED | Per-block clip calls stream.synchronize() + D2H 24×/step. New kernels: ClipScaleReduceKernel (single-CTA norm+clip_scale on GPU), GradientClipGpuScaleKernel (element-wise clip reading scale from GPU memory). Pipeline: 9× squared_sum_launch_into → 1× clip_scale_reduce → 9× gradient_clip_gpu_scale. Zero sync, zero D2H. IEEE 754 handles zero-norm (div→+inf, min→1.0). Compiles, awaiting dogfood. Expected: ~20% step time reduction. |

11.6 Training Quality Gaps

IDIssueComponentGapSeverityStatusAcceptance Criterion
ALB-079entrenar #241entrenarCUDA trainer ignores lr_scheduler — constant lr after warmupCriticalFIXEDCudaTransformerTrainer::current_lr() only had linear warmup; returned constant base_lr after warmup. YAML lr_scheduler: "cosine" parsed but never applied. Five Whys: val_loss plateau at 6.92 + gnorm collapse 3.0→0.13 at constant lr. Fix: cosine decay using max_steps + set_lr() for CPU embed optimizer (entrenar@297308d, PR #241). v4 training launched with cosine decay active.
ALB-080albor #61albor configEffective batch size 48-128x too small for 350M trainingCriticalFIXED4,096 tokens/step vs comparable runs: CodeParrot-small 196K, GPT-2 524K. Root cause: gradient_accumulation: 1 in v3 config. Fix: v4 config with gradient_accumulation: 32 → 131K tokens/step. Same wall-clock, 32x better gradient quality. Target: val_ppl < 100 by 1B tokens. v3 stopped at step 28K (val_ppl=1018, plateau); v4 launched with both fixes.

11.7 Data Pipeline Gaps

IDIssueComponentGapSeverityStatusAcceptance Criterion
ALB-081aprender#418, realizar#136aprenderStreaming APR import + mmap reader — eliminate OOM on large modelsCriticalFIXEDapr import loaded entire 67GB model into RAM (134GB as F32) → swap storm. apr tensors loaded entire .apr into Vec<u8> → 89GB RSS. Five Whys: no streaming write path, no mmap read path. Fix: AprV2StreamingWriter (temp file, peak RAM ~5GB), MappedFile + AprV2ReaderRef for reading (10.9MB RSS on 67GB file). Contract: streaming-reader-v1.yaml, FALSIFY-MMAP-001 verified.

11.8 Observability Gaps

IDIssueComponentGapSeverityStatusAcceptance Criterion
ALB-082entrenar#246entrenarScaling law predictor — early convergence ceiling detectionHighFIXEDFits Kaplan scaling law L(D) = a - b × ln(D) to eval checkpoints via OLS after 3+ data points. Predicts val_ppl at max_steps and warns if improvement < 10%. Would have flagged v4 plateau 20 GPU-hours earlier. Contract: scaling-law-prediction-v1.yaml. Implementation: entrenar PR #247 merged.
ALB-083albor#63alborData pipeline expansion — ingest CodeSearchNet PythonMediumIN PROGRESSCodeSearchNet Python downloaded (455K functions, 133M tokens). Pretokenized to 2048-length sequences (65K seqs). Merged with original data → 180M tokens total. v4 actually used pretokenized-1024-v3 (5.3B tokens from codeparrot-clean-2M), so data wasn’t the bottleneck — insufficient training steps was.

11.9 Evaluation Gaps

IDIssueComponentGapSeverityStatusAcceptance Criterion
ALB-084albor#64apr (aprender)HumanEval pass@k evaluation — wire inference into apr evalCriticalFIXEDapr eval --task humaneval --data humaneval.jsonl loads SafeTensors model via realizar, generates completions with forward_with_cache, truncates at function boundary, executes Python tests with timeout, reports pass@k. Contract: eval-humaneval-v1.yaml. Implementation: aprender PR #429 merged (aprender@a7b1da8c). Temperature sampling, per_problem_results JSON output. Verified end-to-end on v4 checkpoint.
ALB-085albor#65apr (aprender)MBPP benchmark evaluationHighFIXEDrun_mbpp() in eval.rs. 974 problems, text→completion→test_list execution. Contract: eval-mbpp-v1.yaml. Reuses ALB-084 inference bridge (SafetensorsToAprConverter + forward_with_cache + execute_python_test). max_new_tokens=512, timeout=10s.
ALB-086albor#66entrenarSafeTensors checkpoint saves 1D shapes — HuggingFace incompatibleMediumFIXEDContract falsification found: save_safetensors() saves all tensors as 1D [N] instead of 2D [out, in]. Fix: infer_all_tensor_shapes() derives proper shapes from norm weights + element count. entrenar PR #255 merged. Contract: checkpoint-inference-bridge-v1.yaml.
ALB-087albor#67entrenarAutomatic eval scheduling + best-model checkpoint trackingHighFIXEDentrenar PR #254 merged. eval_interval + patience in TrainingParams, decoupled eval from save, best-model tracking (model-best.safetensors), early stopping. Will activate in v5 training with updated config.
ALB-088albor#68apr (aprender)Multi-sample pass@k evaluation (n samples per problem)HighFIXEDaprender PR #432 merged. --samples N --temperature T flags, unbiased pass@k estimator (Chen et al. 2021). Contract: multi-sample-passk-v1.yaml. Will dogfood on v5 checkpoint.
ALB-089albor#69entrenar/aprGPU-accelerated inference for eval (CUDA forward pass)HighDOGFOODING--device cuda wired into apr eval --task humaneval/mbpp. Uses CudaTransformerTrainer::for_inference() + forward_logits(). No KV cache yet (O(n²) but still 20-40x faster than CPU). Awaiting dogfood when GPU is free from training.

11.10 Training Infrastructure Gaps

IDIssueComponentGapSeverityStatusAcceptance Criterion
ALB-091entrenarGPU-resident gradient accumulation — D2H bottleneck kills ga>1 throughputCriticalFIXEDGpuGradientAccumulator accumulates gradients in GPU memory via inplace_add_gpu() (ResidualAddKernel). Zero D2H during micro-batch loop, ONE stream sync per optimizer step. Dogfooded: ga=8, batch=4 → 8.2K tok/s (23.7% MFU) vs previous CPU-side ga: 2.9K tok/s. VRAM cost: 1,520 MB for 350M model.

Gaps are added as they are discovered during implementation and dogfooding.