11. Gap Register
Every gap discovered during development is tracked here. Each gap maps to a specific upstream component, a GitHub issue, and a clear acceptance criterion.
Lifecycle: Gap discovered → GitHub issue filed → implemented upstream →
wired into apr → dogfooded in albor pipeline → FALSIFY/pmat verified → closed.
| Status | Meaning |
|---|---|
| OPEN | Gap identified, not yet implemented |
| IN PROGRESS | GitHub issue filed, work underway |
| DOGFOODING | Implemented, being validated in albor pipeline |
| CLOSED | Verified working end-to-end, issue closed |
11.1 Critical Path Gaps (Block the Improvement Ladder)
| ID | Issue | Component | Gap | Severity | Status | Acceptance Criterion |
|---|---|---|---|---|---|---|
| ALB-001 | #6 | apr (aprender) | apr tokenize plan/apply subcommand | Medium | FIXED | apr tokenize plan validates inputs + estimates time; apr tokenize apply trains BPE/WordPiece/Unigram tokenizer (aprender@90427205). Writes vocab.json + merges.txt. |
| ALB-006 | #7 | apr (aprender) | apr eval plan/apply benchmark harness | High | FIXED | apr eval --task code --data benchmark.jsonl evaluates code completion with pass@1 scoring. apr eval --task plan validates model + data exist. JSONL format with prompt/test/canonical_solution. Phase 1: structural validation. Phase 2: full inference (ALB-009 prerequisite). (aprender@4e61297e) |
| ALB-007 | #8 | entrenar | Parquet→LMBatch bridge via alimentar | Medium | FIXED | load_lm_batches_from_parquet() reads text or pre-tokenized Parquet (single file or directory of shards) via alimentar. Text columns tokenized with HfTokenizer. Column auto-detection (input_ids/token_ids for pre-tokenized, text/content/code for text). Gated behind parquet feature. (entrenar@a5a2fb7) |
| ALB-009 | #1 | apr (entrenar) | apr train plan/apply for pre-training from scratch | Critical | FIXED | apr train plan --task pretrain --config <yaml> validates config via entrenar, shows model architecture and training params. apr train apply --task pretrain --config <yaml> runs full pre-training via train_from_yaml() (TransformerTrainer + CausalLMLoss). Config updated to match entrenar TrainSpec schema. (aprender@d79ed943) |
| ALB-010 | #2 | realizar | Qwen3.5-35B-A3B MoE inference (teacher for distillation) | Critical | DOGFOODING | Steps 1-5b MERGED (PR #133): types, router, expert dispatch, forward integration, shared expert gate, architecture registration, config fields. Step 6 (PR #135): SafeTensors MoE weight loading — detect_model_prefix (ConditionalGeneration wrapper), extract_layer_generic_with_prefix, load_moe_weights (router, packed experts, shared expert), GPU adapter wiring. 15,054 tests pass. Remaining: end-to-end dogfood with Qwen3.5-35B-A3B model files. |
| ALB-011 | #3 | apr (entrenar + realizar) | apr distill plan/apply (precompute + train stages) | Critical | FIXED | apr distill --config <yaml> --plan validates config, shows teacher/student/training params. apr distill --config <yaml> --stage precompute inspects teacher, writes manifest. apr distill --config <yaml> --stage train validates precompute manifest, sets up KD training. Local DistillYamlConfig matches entrenar schema. (aprender@81dd4432) |
| ALB-018 | #19 | entrenar/alimentar | Fill-in-the-Middle (FIM) data transform (PSM/SPM) | High | FIXED | alimentar fim transform with PSM/SPM formats, configurable rate/seed (alimentar@290582d). Fim struct implements Transform trait for pipeline integration. |
| ALB-019 | #20 | alimentar | alimentar import local for local Python files | Medium | FIXED | alimentar import local subcommand now available (alimentar@265541b). Supports CSV/JSON/JSONL/Parquet format conversion. |
| ALB-020 | #21 | alimentar | alimentar mix with weighted upsampling | Medium | FIXED | alimentar mix with weighted sampling and upsampling now available (alimentar@64b1e92). Syntax: alimentar mix a.parquet:0.8 b.parquet:0.2 -o out.parquet. |
| ALB-021 | #22 | entrenar | Custom model architecture params in YAML | High | FIXED | ArchitectureOverrides struct carries YAML manifest architecture: params through bridge converter to TransformerConfig. Supports all fields: hidden_size, num_layers, num_heads, num_kv_heads, intermediate_size, vocab_size, max_seq_length, rms_norm_eps, rope_theta, use_bias. (entrenar@a414861) |
| ALB-022 | #23 | entrenar | Human-readable value shorthand in YAML configs | Low | FIXED | parse_human_usize() and deserialize_human_usize_opt support SI suffixes (32K, 1M, 10B, 1T), scientific notation (1e6), and fractional suffixes (1.5K). Applied to ArchitectureConfig and DataConfig fields. (entrenar@1cb0950) |
| ALB-023 | #24 | apr (aprender) | Plan/apply contract for all subcommands | High | FIXED | Every apr <cmd> action command now exposes plan mode: merge --plan, export --plan, publish --plan added to join existing train plan/apply, tokenize plan/apply, quantize --plan, finetune --plan, prune --plan, distill --plan, eval --task plan. Pre-dispatch contract validation skipped in plan mode. (aprender@526a1e4b) |
| ALB-024 | #25 | apr (aprender) | apr experiment view — interactive SQLite experiment browser | Medium | FIXED | apr experiment view --global opens ratatui TUI with run table, sparkline, and braille loss chart. --json mode for CI. Reads local or global ~/.entrenar/experiments.db. (aprender@1196d244) |
| ALB-025 | #26 | presentar + apr | apr monitor upgrade — presentar widgets for live training TUI | Medium | FIXED | TrainingDashboard composes presentar-terminal Meter, GpuPanel, Sparkline, Text, Border, Layout (ALB-057). TuiApp handles resize/Ctrl+C/diffing (ALB-047/048). WASM compilation deferred to ALB-026. (entrenar@0ad416e) |
| ALB-026 | #27 | presentar | WASM training dashboard — albor-dashboard.yaml | Medium | OPEN | Declarative YAML dashboard config that renders training metrics, experiment comparison, and model card via presentar serve. Embeddable in HuggingFace model card as static WASM artifact. |
| ALB-027 | #4 | forjar | task resource type for pipeline orchestration | Critical | FIXED | New forjar resource type: runs arbitrary command, tracks exit code, hashes output_artifacts for idempotency via b3sum, supports completion_check and timeout. Handlers: check_script (completion_check or artifact existence), apply_script (set -euo pipefail, working_dir, timeout), state_query_script (b3sum artifacts). Validation: command required, timeout > 0. (forjar@d14e633) |
| ALB-028 | #5 | apr (aprender) | apr pipeline plan/apply wrapping forjar DAG engine | Critical | FIXED | apr pipeline plan shows full DAG with 23 resources across 2 machines. apr pipeline apply converges via forjar engine. apr pipeline status shows state. apr pipeline validate checks manifest. Shells out to forjar binary (decoupled). (aprender@e653d5ca) |
11.2 Distributed Training Gaps (Stretch / Future)
| ID | Issue | Component | Gap | Severity | Status | Acceptance Criterion |
|---|---|---|---|---|---|---|
| ALB-002 | #9 | repartir | Ring all-reduce implementation | High | OPEN | Gradient tensors synchronized across 2+ workers with <5% overhead |
| ALB-003 | #10 | entrenar | repartir integration for distributed training | High | OPEN | Training loop calls repartir::GradientSync for multi-worker training |
| ALB-004 | #11 | entrenar | Unified CUDA + wgpu backend dispatch | Medium | OPEN | Same training config runs on CUDA (4090) and wgpu (W5700X) |
| ALB-005 | #12 | trueno | wgpu backward pass (gradient WGSL shaders) | High | OPEN | Compute shaders for matmul_backward, gelu_backward, rmsnorm_backward, attention_backward |
| ALB-008 | #13 | repartir | Heterogeneous worker throughput balancing | Medium | OPEN | Workers with different GPU speeds get proportional workload |
11.3 Quality & Verification Gaps
| ID | Issue | Component | Gap | Severity | Status | Acceptance Criterion |
|---|---|---|---|---|---|---|
| ALB-013 | #14 | provable-contracts | Knowledge distillation contract | High | DOGFOODING | knowledge-distillation-kernel-v1.yaml — committed and passes pv validate. 3 equations, 6 obligations, 5 falsification tests, 2 Kani harnesses. Needs binding to entrenar implementation. |
| ALB-014 | #15 | provable-contracts | BPE tokenizer contract | Medium | DOGFOODING | bpe-tokenizer-kernel-v1.yaml — committed and passes pv validate. Roundtrip invariant, FIM sentinel tests. Needs binding to aprender BPE. |
| ALB-015 | #16 | provable-contracts | Model merging contract (SLERP, TIES, DARE) | Medium | DOGFOODING | model-merging-kernel-v1.yaml — committed and passes pv validate. SLERP bound, DARE unbiased estimator. Needs binding. |
| ALB-016 | #17 | provable-contracts | Pruning contract (WANDA, magnitude) | Medium | DOGFOODING | pruning-kernel-v1.yaml — committed and passes pv validate. Sparsity invariant, score ordering. Needs binding. |
| ALB-017 | #18 | provable-contracts | Gradient accumulation contract | High | DOGFOODING | gradient-accumulation-kernel-v1.yaml — committed and passes pv validate. Numerical equivalence, gradient zeroing. Needs binding. |
Contract coverage report (pv coverage contracts): 8 contracts, 31 equations, 51 obligations, 34 falsification tests, 10 Kani harnesses, 100% obligation coverage. All contracts at impl=0/N — waiting for upstream bindings.
11.4 Dogfooding-Discovered Gaps
| ID | Issue | Component | Gap | Severity | Status | Acceptance Criterion |
|---|---|---|---|---|---|---|
| ALB-029 | #28 | batuta | batuta falsify false positives on project repos | Medium | FIXED | Fixed upstream in batuta@905a862: AI-01 searches configs/, AI-04 excludes book-output/, AI-05 detects pv/forjar validation. Score: 72.2% → 73.1%. |
| ALB-030 | #29 | batuta | batuta stack status fails without Cargo.toml | Low | FIXED | Fixed upstream in batuta@371557a: Falls back to binary detection, discovers 11 installed PAIML tools with versions. |
| ALB-031 | #30 | batuta | batuta hf search returns mock/placeholder data | Low | OPEN | batuta hf search model "code completion" returns live HuggingFace Hub results instead of placeholder models. |
| ALB-033 | #31 | apr (aprender) | apr tokenize → entrenar tokenizer.json format gap | Medium | DOGFOODING | apr tokenize apply produces vocab.json + merges.txt but entrenar expects HuggingFace tokenizer.json. Workaround: Python tokenizers lib. |
| ALB-034 | #32 | entrenar | max_steps config not respected in training loop | Medium | FIXED | max_steps wired through YAML manifest → bridge → TrainingParams → TransformerTrainConfig → trainer loop. Training stops when optimizer step count reaches limit (entrenar@07db101). |
| ALB-035 | #33 | entrenar | Does not write training_state.json during training | Medium | FIXED | Added train_epoch_with_callback() and per-step logging (~100 lines/epoch) in entrenar@5d41a96. |
| ALB-036 | #34 | apr (aprender) | BPE tokenizer normalizes whitespace | Medium | DOGFOODING | split_whitespace() pre-tokenizer destroys Python indentation. Workaround: ByteLevel BPE v2. |
| ALB-037 | #35 | realizar | SafeTensors inference ignores loaded weights | High | FIXED | Root cause chain: ALB-038 (no gradient flow) → ALB-043 (backward_ffn buffer overflow + wrong SwiGLU gradients). Secondary: entrenar didn’t save config.json (entrenar@6097780). Verified e2e: realizar run loads 350M trained checkpoint (218 tensors), generates tokens from learned weights. |
| ALB-038 | #36 | entrenar | Saves initialization weights, not trained weights | Critical | FIXED | Root cause: RMSNorm::forward_batched() created tensors with no backward op, blocking all gradient flow. Attention forward() also broke Q/K/V gradients. Fixed in entrenar@91ba9da (norm backward) and entrenar@1ede409 (attention backward). All 20 model parameters now receive gradients. |
| ALB-040 | #38 | entrenar | GPU-resident pretraining — wire CudaTransformerBlock into TransformerTrainer | Critical | VERIFIED | CudaTransformerTrainer in cuda_trainer.rs follows classify_pipeline.rs pattern. 3 PCIe transfers/step vs 16K. Auto-detect CUDA with graceful CPU fallback. Contract: training-gpu-kernel-v1.yaml. 350M verified: 50-step test loss 10.39→6.07, checkpoint valid, realizar loads + generates. Full training running (seq=1024, batch=4, accum=128). |
| ALB-041 | #39 | entrenar | D2D buffer size mismatch in CudaTransformerBlock backward_attention | High | FIXED | backward_attention() used gate_out (intermediate_size) as temp buffer for grad_hidden accumulation, but D2D copy requires exact size match. Fixed: use o_proj_out (hidden_size). Also added seq_len truncation and error logging in CudaTransformerTrainer. (entrenar@a48e3d2) |
| ALB-042 | #40 | entrenar | CudaTransformerTrainer runtime errors → silent loss=0.0 instead of CPU fallback | Medium | OPEN | When CUDA operations fail during training (e.g., VRAM contention), trainer should detect N consecutive failures and gracefully fall back to CPU mode. Currently reports loss=0.0 and saves garbage checkpoint. Workaround: CUDA_VISIBLE_DEVICES="". |
| ALB-043 | #41 | entrenar | backward_ffn buffer overflow + missing SwiGLU gradients | Critical | FIXED | Two bugs: (1) silu_backward wrote [S,I] output into [S,H] buffer (4× overflow → CUDA_ERROR_ILLEGAL_ADDRESS). (2) SwiGLU backward missing ×up factor in gate gradient; grad_up/grad_w_up completely absent (w_up never trained). Fixed with correct 10-step decomposition using elementwise_mul_forward, silu_forward, silu_backward. (entrenar@f7805f1) |
| ALB-044 | #42 | entrenar | Unclipped activation gradients + CPU optimizer hyperparameter mismatch cause 350M NaN | Critical | FIXED | Two bugs: (1) Activation gradient from block[0] backward (~1e35) unclipped — per-block clipping only applies to weight gradients in CudaGradWorkspace. (2) CPU AdamW used default_params(lr) (β₂=0.999, wd=0.01) instead of YAML config (β₂=0.95, wd=0.1) — 50× bias correction amplification overflows f32. Fixed: C-EMBED-GRAD-001 clips activation gradient before scatter-add; CPU optimizer matches YAML hyperparams. 350M now trains without NaN. |
| ALB-045 | — | entrenar | train_loop_cuda does not write training_state.json — apr monitor blind to pretraining | Critical | FIXED | write_training_snapshot() helper in src/config/train/loader.rs writes TrainingSnapshot to training_state.json on every log interval. Both train_loop_cuda and train_loop_cpu now emit Initializing→Running→Completed snapshots. Verified: apr monitor checkpoints/albor-base-350m/ shows live TUI with loss curve, GPU name, tok/s, progress during CUDA 350M pretraining. (entrenar@2ddc11c) |
| ALB-046 | — | entrenar | GPU telemetry all zeros in training_state.json — no live NVML/nvidia-smi data | High | FIXED | query_gpu_telemetry() shells out to nvidia-smi --query-gpu with CSV output, populates all GpuTelemetry fields. Wired into write_training_snapshot(). Verified: util=5%, VRAM=12.0G/24.0G, temp=41°C, power=94W/480W during 350M training (entrenar@9b53c13). |
| ALB-047 | — | entrenar | TUI monitor hardcodes width=80, no terminal resize handling | Medium | FIXED | Replaced hand-rolled renderer with presentar-terminal TuiApp. Gets terminal resize detection for free from crossterm backend + presentar’s smart diffing. TuiMonitorConfig.width/height retained for headless mode only (entrenar@9b53c13). |
| ALB-048 | — | entrenar | No signal handling in TUI monitor — Ctrl+C leaves cursor hidden | Medium | FIXED | presentar-terminal TuiApp::run() handles Ctrl+C/q with clean cursor restore, screen cleanup, and status message. No raw signal handlers needed — crossterm event loop + Drop impl (entrenar@9b53c13). |
| ALB-049 | — | entrenar | No keyboard input in TUI monitor — can’t scroll/pause/interact | Low | FIXED | presentar-terminal TuiApp provides crossterm event loop with q quit and Ctrl+C. Scroll/pause deferred to presentar widget-level interaction (GpuPanel, LossCurve already support focus). |
| ALB-050 | — | apr (aprender) | No apr runs ls — can’t list past training experiments | High | FIXED | apr runs ls reads local/global SQLite registry, shows table of runs with status, final loss, tok/s, duration. apr runs show <id> shows detailed metrics + hyperparameters. Supports --global, --json, --status filter. (aprender@91641f2e) |
| ALB-051 | — | apr (aprender) | No run comparison — can’t overlay loss curves from two runs | Medium | FIXED | apr runs diff <a> <b> shows side-by-side comparison: inline sparklines, loss trajectory overlay, config diff (only changed params), final metric comparison with verdict (winner by final loss). Supports --json for LLM agents. (aprender@9f9e9f63) |
| ALB-052 | — | entrenar | SQLite experiment tracking exists but not wired to pretraining | Medium | FIXED | PretrainTracker in config/train/loader.rs writes to both local and global SQLite stores. Uses existing SqliteBackend with ExperimentStorage trait. Logs experiment metadata, hyperparameters, and per-step metrics (loss, lr, tok/s). Best-effort — storage failures never block training. (entrenar@daa0afc) |
| ALB-053 | — | entrenar | HeadlessOutput JSON missing fields present in TUI | High | FIXED | HeadlessOutput now has full field parity with TUI: global_step, progress_percent, loss_history, lr_history, elapsed_seconds, optimizer_name, batch_size, model_path, checkpoint_path, executable_path, accuracy, samples_per_second, HeadlessSample. From<&TrainingSnapshot> populates all fields. All 6 headless tests pass. (entrenar@9b53c13) |
| ALB-054 | — | entrenar + apr | No multi-job monitoring — can’t watch multiple concurrent training runs | High | FIXED | apr monitor (no args) discovers active training runs from global SQLite registry (~/.entrenar/experiments.db). Checks for live training_state.json in registered output dirs. Lists active runs with experiment name, directory, run ID, start time. apr monitor <dir> attaches to specific run. Supports --json output for LLM agents. (aprender@91641f2e) |
| ALB-055 | — | entrenar | No local SQLite experiment DB per training run | High | FIXED | PretrainTracker opens <output_dir>/.entrenar/experiments.db for local per-experiment metrics history. Logs experiment metadata, hyperparameters (task, model, optimizer, lr, epochs, batch_size, seq_len, max_steps, device), and per-step metrics (loss, lr, tok/s). All best-effort via SqliteBackend. (entrenar@daa0afc) |
| ALB-056 | — | entrenar | No global SQLite experiment registry | High | FIXED | PretrainTracker opens ~/.entrenar/experiments.db for global cross-machine experiment registry. Same schema as local: experiment + run + hyperparams + per-step metrics. apr runs ls --global reads it. apr monitor (no args) discovers active runs from it. (entrenar@daa0afc) |
| ALB-057 | — | entrenar | Dashboard paints raw text instead of composing presentar widgets | Medium | FIXED | TrainingDashboard composes presentar-terminal widgets via Layout::rows(): Border for section panels, Meter for progress bar, GpuPanel for GPU telemetry (with GpuDevice/GpuProcess conversion from entrenar types), Sparkline for loss history, Text for info lines. Widget tree rebuilt each frame from snapshot. Panel verification wired into Brick::verify() via layout_can_render(). (entrenar@0ad416e) |
| ALB-058 | — | apr (aprender) | apr monitor --json flag missing | Medium | FIXED | apr monitor --json <dir> streams headless JSON output with full TUI parity (ALB-053). apr monitor --format text <dir> for human-readable log lines. --json flag overrides --format. Routes to HeadlessMonitor for JSON/text, TuiMonitor for TUI. (aprender@91641f2e) |
| ALB-059 | — | entrenar | GEMM backward constructor args n/k swapped — buffer overflow into optimizer states | Critical | FIXED | GemmBackwardAKernel::tiled_unrolled(m, k, n, tile) called with k and n swapped vs trueno constructor (m, n, k, tile_size). Bakes wrong stride constants into PTX: output stride = vocab_size (32768) instead of hidden_size (512) for LM head backward. Rows overflow 64× into adjacent VRAM (m_w_k, v_w_k of block 0). Negative values in v_w_k → sqrt(negative) = NaN in AdamW. Same bug in backward_b. Also zero-initialized all optimizer m/v buffers (cuMemAlloc returns uninitialized VRAM). (entrenar@846ae0c) |
| ALB-060 | — | entrenar / albor config | epochs: 1 exhausts data before max_steps reached — 350M trains only 43/5000 steps | Critical | CONFIG FIXED | Root cause: 22K seqs, batch=4, accum=128 → 43 steps/epoch, max_steps=5000 unreachable. Fix: C-TRAINCFG-001 contract + v2 config (pretrain-350m-v2.yaml) with 68K seqs, accum=1, steps_per_epoch=16994 >= 5000. v1 config also fixed with epochs=117. V2 training partially completed (ALB-063). |
| ALB-061 | #43 | albor docs | Monolithic spec stale — diverges from mdBook chapters | Medium | FIXED | scripts/generate-spec.sh regenerates docs/specifications/albor-llm-spec.md from mdBook chapters. make spec target added. |
| ALB-062 | #44 | albor docs | Stale spec chapters — §3 VRAM, §15/18 blockers, §16 repro, model card, intro | Medium | FIXED | All chapters updated to match reality: VRAM budget, ALB-025/037 no longer blockers, v2 pipeline in §16, ALB-060 context in model card and introduction. |
| ALB-063 | #45 | albor training | Retrain 350M with v2 config (corrected epochs + expanded data) | Critical | IN PROGRESS | ALB-069→072 all fixed. Training running: PID 1775202, ~4.4s/step (934 tok/s), save_interval=250, 5000 steps, ~11.8 GB VRAM. Loss 10.40→7.13 (step 169)→6.77 (step 338). Step 250 eval: val_loss=6.92, val_ppl=1008. Step 500 checkpoint verified OK (1520 MB). gnorm stable 2-9 range. |
| ALB-064 | #46 | albor / entrenar | Training process dies silently — no crash detection, no watchdog, no recovery | Critical | FIXED | scripts/train-guard.sh: crash-resilient supervisor with exit code classification, GPU state capture, structured JSON crash reports, exponential backoff restart, heartbeat monitoring, pre-flight GPU health checks. Auto-diagnostic mode: detects async CUDA crash pattern, enables CUDA_LAUNCH_BLOCKING=1 on restart. Five Whys: CUDA driver crash → SIGABRT/SIGSEGV → bypasses Rust panic handler → no stderr output → no diagnosis. Root cause: ALB-065. |
| ALB-065 | #47 | entrenar / trueno | Missing stream.synchronize() before D2H gradient transfers — async CUDA crash | Critical | FIXED | compute_workspace_clip_scale() and compute_clip_scale() call cuMemcpyDtoH without synchronizing the non-blocking CUDA stream. cuMemcpyDtoH only synchronizes with the default stream, but trueno creates streams with CU_STREAM_NON_BLOCKING. Result: backward kernels not finished when gradient buffers are read → garbage clip scale → NaN/crash. Fix: stream.synchronize() at 3 locations before D2H transfers (entrenar@d3a3d26). |
| ALB-066 | #48 | albor config | gradient_accumulation: 128 makes training take 68.8 days on single GPU | Critical | FIXED | CudaTransformerTrainer does per-sequence optimizer updates (per-block interleaved backward+optimize). gradient_accumulation just increases sequences per “step” without changing update granularity. Fix: reduced 128→16→1, epochs from 38→5→1. New estimate: ~11.7h at 480 tok/s. |
| ALB-067 | #49 | entrenar / trueno | Per-block weight gradient clipping CPU bottleneck — 864 D2H transfers/step | High | FIXED (via ALB-078) | compute_workspace_clip_scale downloaded 9 buffers × 24 blocks × 4 seqs = 864 D2H transfers/step. Workaround: disabled per-block clipping (entrenar@eaadbc6). Proper fix: ALB-078 fused GPU clip pipeline (zero D2H, zero sync). grad_clip: 1.0 re-enabled in v3 config. |
| ALB-068 | #50 | entrenar | save_interval dead code — no intermediate checkpoint saving during CUDA training | Critical | FIXED | save_interval read from config, validated, but never used in train_loop_cuda(). Checkpoints only saved at training completion. 24h crash = total loss. Fix: manual batch loop with trainer.save() at save_interval boundaries (entrenar@d8dfab7). |
| ALB-069 | #51 | trueno | PTX selp_f32 argument order bug in fused cross-entropy kernels — training produces loss=0.0 | Critical | FIXED | selp_f32(pred, true_val, false_val) called as selp_f32(grad_target, grad_nontarget, is_target) — f32 values in pred slot, predicate in false_val slot. PTX JIT fails: “Arguments mismatch for instruction ‘selp’”. Same class as ALB-059 (constructor arg ordering). Fix: selp_f32(is_target, grad_target, grad_nontarget) at both call sites (trueno@10bec89, trueno#156). |
| ALB-070 | #52 | entrenar / albor config | save_interval YAML field ignored — bridge reads checkpoint.save_every, default=1 causes eval every step | Critical | FIXED | YAML bridge reads training.checkpoint.save_every, not training.save_interval. Default=1 → validation eval runs every step → eval_batch() crashes on long sequences (missing max_seq_len truncation). Two fixes: (1) YAML config moved to checkpoint.save_every: 25 (2) eval_batch() now truncates to max_seq_len (entrenar@5c4c2d8). Same class as ALB-060 (config field mismatch). |
| ALB-071 | #53 | entrenar | Embed gradient clipping disabled when grad_clip=None — NaN weights, loss=0.0 by step ~100 | Critical | FIXED | C-EMBED-GRAD-001 was gated behind if let Some(max_norm) = max_grad_norm. ALB-067 disabled grad_clip → embed activation gradients unclipped → CPU AdamW overflow → 304K NaN in embeddings, block weights ALL NaN. Fix: always clip with unwrap_or(1.0) + always compute LM head grad norm for observability (entrenar@d07d67d). Same class as ALB-044. |
| ALB-072 | #54 | entrenar | fp16 loss scaling causes NaN in early layers — gradient overflow in f32 backward | Critical | FIXED | fp16 GradScaler (scale=65536) multiplied into fused CE kernel’s loss_scale. All backward uses f32 GpuBuffers — no fp16 underflow risk, but 65536x scaling caused activation gradient overflow by layers 0-1. Five Whys: loss=0.0 → NaN blocks 0-1 → first optimizer step NaN → FP32 works/FP16 doesn’t → unnecessary 65536x scaling. Fix: exclude grad_scaler.scale() from loss_scale (entrenar@44d3e74). gnorm now matches FP32 baseline (2.29). |
| ALB-073 | #55 | trueno | fused_cross_entropy PTX selp argument mismatch — JIT compilation failure | High | FIXED | Same class as ALB-069. selp_f32(true_val, false_val, pred) instead of (pred, true_val, false_val) in fused cross-entropy kernel. Training fell back to non-fused path. Fix: trueno@10bec89. |
| ALB-074 | #56 | entrenar | Buffer overflow — 2048-token seq hits 1024-sized GPU buffer during eval | Critical | FIXED | Stale binary missed ALB-070 eval truncation fix. 2048-token pretokenized sequence passed to eval_single_sequence without max_seq_len truncation → slice overflow at cuda_trainer.rs:711 (2096128 > 1048576). Crashed at step 1183. Fix: binary rebuild with entrenar@5c4c2d8. |
11.5 Performance Optimization Gaps
| ID | Issue | Component | Gap | Severity | Status | Acceptance Criterion |
|---|---|---|---|---|---|---|
| ALB-075 | #57 | trueno / entrenar | cuBLAS tensor core GEMM integration — replaced PTX GEMMs with TF32 tensor cores | Critical | FIXED | trueno-gpu 0.4.24 (cuBLAS FFI, PR #165 merged), entrenar PR #233 merged. Measured: 1,485 tok/s (4.3% MFU), 1,379ms/step, 3.19x end-to-end speedup. Kernel-level: 74-142 TFLOP/s vs 4.8-6.1 PTX (12-27x). Contract: cublas-gemm-v1.yaml. |
| ALB-076 | #58 | entrenar | Forward RMSNorm per-row kernel launch — 97.1% of GPU time | Critical | FIXED | rms_norm_forward() launched one 32-thread kernel per row (2048 launches/norm × 49 norms = 100,352 launches/step). nsys profiling: 46.6s/50 steps, avg 9.3μs each. Fix: switched to BatchedVectorizedRmsNormKernel (single launch, 256 threads, blockIdx.y batch dispatch). entrenar PR #238 merged. Measured: forward 347ms→14ms (24.8×), step 1357ms→339ms (4×), MFU 4.4%→17.5% (4×). |
| ALB-077 | trueno #170, entrenar #239 | trueno / entrenar | cuBLAS tensor core GEMM produces NaN for transposed backward GEMMs | Critical | FIXED | CUBLAS_GEMM_DEFAULT_TENSOR_OP outputs ALL NaN for Trans/NoTrans and NoTrans/Trans operations when gradient magnitudes reach ~1e5 (block 18 of 24-layer backward). Forward NoTrans/NoTrans unaffected. Five Whys: gradient magnification through 24 layers triggers undocumented tensor core numerical fault. Fix: CUBLAS_DEFAULT_MATH + CUBLAS_COMPUTE_32F + CUBLAS_GEMM_DEFAULT (no tensor cores, SIMD path). Phase 5a (TF32) reverted. Measured: 5,216 tok/s (15.1% MFU), 5.9× over PTX baseline, 0 NaN. |
| ALB-078 | trueno #171, entrenar #240 | trueno / entrenar | Fused GPU gradient clipping — eliminate 26 stream syncs/step | High | IMPLEMENTED | Per-block clip calls stream.synchronize() + D2H 24×/step. New kernels: ClipScaleReduceKernel (single-CTA norm+clip_scale on GPU), GradientClipGpuScaleKernel (element-wise clip reading scale from GPU memory). Pipeline: 9× squared_sum_launch_into → 1× clip_scale_reduce → 9× gradient_clip_gpu_scale. Zero sync, zero D2H. IEEE 754 handles zero-norm (div→+inf, min→1.0). Compiles, awaiting dogfood. Expected: ~20% step time reduction. |
11.6 Training Quality Gaps
| ID | Issue | Component | Gap | Severity | Status | Acceptance Criterion |
|---|---|---|---|---|---|---|
| ALB-079 | entrenar #241 | entrenar | CUDA trainer ignores lr_scheduler — constant lr after warmup | Critical | FIXED | CudaTransformerTrainer::current_lr() only had linear warmup; returned constant base_lr after warmup. YAML lr_scheduler: "cosine" parsed but never applied. Five Whys: val_loss plateau at 6.92 + gnorm collapse 3.0→0.13 at constant lr. Fix: cosine decay using max_steps + set_lr() for CPU embed optimizer (entrenar@297308d, PR #241). v4 training launched with cosine decay active. |
| ALB-080 | albor #61 | albor config | Effective batch size 48-128x too small for 350M training | Critical | FIXED | 4,096 tokens/step vs comparable runs: CodeParrot-small 196K, GPT-2 524K. Root cause: gradient_accumulation: 1 in v3 config. Fix: v4 config with gradient_accumulation: 32 → 131K tokens/step. Same wall-clock, 32x better gradient quality. Target: val_ppl < 100 by 1B tokens. v3 stopped at step 28K (val_ppl=1018, plateau); v4 launched with both fixes. |
11.7 Data Pipeline Gaps
| ID | Issue | Component | Gap | Severity | Status | Acceptance Criterion |
|---|---|---|---|---|---|---|
| ALB-081 | aprender#418, realizar#136 | aprender | Streaming APR import + mmap reader — eliminate OOM on large models | Critical | FIXED | apr import loaded entire 67GB model into RAM (134GB as F32) → swap storm. apr tensors loaded entire .apr into Vec<u8> → 89GB RSS. Five Whys: no streaming write path, no mmap read path. Fix: AprV2StreamingWriter (temp file, peak RAM ~5GB), MappedFile + AprV2ReaderRef for reading (10.9MB RSS on 67GB file). Contract: streaming-reader-v1.yaml, FALSIFY-MMAP-001 verified. |
11.8 Observability Gaps
| ID | Issue | Component | Gap | Severity | Status | Acceptance Criterion |
|---|---|---|---|---|---|---|
| ALB-082 | entrenar#246 | entrenar | Scaling law predictor — early convergence ceiling detection | High | FIXED | Fits Kaplan scaling law L(D) = a - b × ln(D) to eval checkpoints via OLS after 3+ data points. Predicts val_ppl at max_steps and warns if improvement < 10%. Would have flagged v4 plateau 20 GPU-hours earlier. Contract: scaling-law-prediction-v1.yaml. Implementation: entrenar PR #247 merged. |
| ALB-083 | albor#63 | albor | Data pipeline expansion — ingest CodeSearchNet Python | Medium | IN PROGRESS | CodeSearchNet Python downloaded (455K functions, 133M tokens). Pretokenized to 2048-length sequences (65K seqs). Merged with original data → 180M tokens total. v4 actually used pretokenized-1024-v3 (5.3B tokens from codeparrot-clean-2M), so data wasn’t the bottleneck — insufficient training steps was. |
11.9 Evaluation Gaps
| ID | Issue | Component | Gap | Severity | Status | Acceptance Criterion |
|---|---|---|---|---|---|---|
| ALB-084 | albor#64 | apr (aprender) | HumanEval pass@k evaluation — wire inference into apr eval | Critical | FIXED | apr eval --task humaneval --data humaneval.jsonl loads SafeTensors model via realizar, generates completions with forward_with_cache, truncates at function boundary, executes Python tests with timeout, reports pass@k. Contract: eval-humaneval-v1.yaml. Implementation: aprender PR #429 merged (aprender@a7b1da8c). Temperature sampling, per_problem_results JSON output. Verified end-to-end on v4 checkpoint. |
| ALB-085 | albor#65 | apr (aprender) | MBPP benchmark evaluation | High | FIXED | run_mbpp() in eval.rs. 974 problems, text→completion→test_list execution. Contract: eval-mbpp-v1.yaml. Reuses ALB-084 inference bridge (SafetensorsToAprConverter + forward_with_cache + execute_python_test). max_new_tokens=512, timeout=10s. |
| ALB-086 | albor#66 | entrenar | SafeTensors checkpoint saves 1D shapes — HuggingFace incompatible | Medium | FIXED | Contract falsification found: save_safetensors() saves all tensors as 1D [N] instead of 2D [out, in]. Fix: infer_all_tensor_shapes() derives proper shapes from norm weights + element count. entrenar PR #255 merged. Contract: checkpoint-inference-bridge-v1.yaml. |
| ALB-087 | albor#67 | entrenar | Automatic eval scheduling + best-model checkpoint tracking | High | FIXED | entrenar PR #254 merged. eval_interval + patience in TrainingParams, decoupled eval from save, best-model tracking (model-best.safetensors), early stopping. Will activate in v5 training with updated config. |
| ALB-088 | albor#68 | apr (aprender) | Multi-sample pass@k evaluation (n samples per problem) | High | FIXED | aprender PR #432 merged. --samples N --temperature T flags, unbiased pass@k estimator (Chen et al. 2021). Contract: multi-sample-passk-v1.yaml. Will dogfood on v5 checkpoint. |
| ALB-089 | albor#69 | entrenar/apr | GPU-accelerated inference for eval (CUDA forward pass) | High | DOGFOODING | --device cuda wired into apr eval --task humaneval/mbpp. Uses CudaTransformerTrainer::for_inference() + forward_logits(). No KV cache yet (O(n²) but still 20-40x faster than CPU). Awaiting dogfood when GPU is free from training. |
11.10 Training Infrastructure Gaps
| ID | Issue | Component | Gap | Severity | Status | Acceptance Criterion |
|---|---|---|---|---|---|---|
| ALB-091 | — | entrenar | GPU-resident gradient accumulation — D2H bottleneck kills ga>1 throughput | Critical | FIXED | GpuGradientAccumulator accumulates gradients in GPU memory via inplace_add_gpu() (ResidualAddKernel). Zero D2H during micro-batch loop, ONE stream sync per optimizer step. Dogfooded: ga=8, batch=4 → 8.2K tok/s (23.7% MFU) vs previous CPU-side ga: 2.9K tok/s. VRAM cost: 1,520 MB for 350M model. |
Gaps are added as they are discovered during implementation and dogfooding.