Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

17. Success Criteria

Minimum Viable (Phase 3 complete)

  • 350M base model trained on 4090 to convergence (target: ~10B tokens; current: 139M v2 dataset)
  • FIM (fill-in-the-middle) training implemented and validated (ALB-018 FIXED — alimentar fim verified)
  • HumanEval pass@1 > 8% (baseline Python capability, beat random)
  • HumanEval-FIM working (model can infill Python code)
  • Entire pipeline uses only sovereign stack components
  • All training artifacts reproducible from spec
  • All existing kernel contracts pass pv audit (Level 2+)
  • pmat comply check passes on all modified components

Current blockers for Phase 3 completion:

  • ALB-038 (Critical): entrenar saves initialization weights, not trained weights FIXED (entrenar@91ba9da, @1ede409)
  • ALB-035: No per-step loss logging during training FIXED (entrenar@5d41a96)
  • ALB-041: D2D buffer mismatch in backward_attention FIXED (entrenar@a48e3d2)
  • ALB-037: realizar ignores loaded weights FIXED (e2e verified: realizar run loads 350M trained checkpoint, generates tokens from 218 tensors)
  • ALB-043 (Critical): backward_ffn buffer overflow + missing SwiGLU gradients FIXED (entrenar@f7805f1)
  • ALB-044 (Critical): activation gradient clipping + CPU optimizer hyperparams FIXED (entrenar@86eec38)
  • ALB-059 (Critical): GEMM backward constructor n/k swapped — buffer overflow into optimizer states FIXED (entrenar@846ae0c)
  • ALB-040: GPU-resident pretraining VERIFIED — 350M CUDA test: 50 steps, loss 10.39→5.92, checkpoint valid, realizar inference works
  • ALB-042: CUDA runtime errors produce silent loss=0.0 — OPEN (workaround: CUDA_VISIBLE_DEVICES="")
  • ALB-069 (Critical): PTX selp_f32 argument order in fused cross-entropy FIXED (trueno@10bec89)
  • ALB-060 (Critical): Training ran only 43/5000 steps (epochs=1). CONFIG FIXED: C-TRAINCFG-001 contract + v2 config. V2 training (ALB-063) restarted after ALB-069 fix — PID 106929, loss=10.39 at step 1.

350M CUDA test results (50 steps, post ALB-059 fix):

  • Loss: 10.39 → 5.92 (best: 5.53) — clear convergence with correct GEMM backward
  • Training time: ~400s (~8s/step) with PTX; ~26s (~0.5s/step) with cuBLAS (ALB-075/077)
  • Checkpoint: 1.59 GB SafeTensors, 218 tensors, config.json saved
  • Checkpoint validation: PASS (weights trained, layers distinct)
  • realizar inference: loads model, generates tokens (gibberish at 50 steps — expected)
  • Perplexity: 31,926 (finite; random baseline ~32,768 for vocab 32K)

350M v3 training (250K steps, codeparrot-clean, ALB-077 fix) — STOPPED:

  • Final: step 28K, loss=6.43, val_ppl=1018, 6.7K tok/s, 19.3% MFU
  • Plateau since step 12K — val_ppl stalled at ~1000, gnorm collapsed 3.0→0.13
  • Root cause: ALB-079 (constant lr after warmup, no cosine decay) + ALB-080 (4K tokens/step, 48-128x too small)
  • Checkpoints: step 1K-28K (1520 MB each, all verified OK)
  • No NaN in 28K steps (ALB-077: tensor cores disabled, CUBLAS_DEFAULT_MATH)

350M v4 training (ALB-079 + ALB-080 fixes) — RESUMED from step 500:

  • Fixes: cosine LR decay (entrenar PR #241) + gradient_accumulation=32 (131K tokens/step)
  • Original run: 500 steps, val_ppl=1032.7 (matched v3 at 57% token budget)
  • System reboot at step 553; resumed from step-500 checkpoint
  • Extended resume: step 350 (cum. step 850), best loss=5.69 at step 262
  • 111M tokens processed (2.1% of 5.3B available); loss plateau at mean ~6.65
  • Cosine decay just engaging (lr 3.00e-4→2.98e-4); expect plateau break at step 1000+
  • ZClip catching gradient spikes (z=2.0–4.0), gnorm healthy 0.05–0.32
  • Throughput: 3,564–3,569 tok/s steady, 10.3% MFU, 14-16 GB / 24 GB VRAM
  • Target: val_ppl < 100 by 1B tokens (~60 hours remaining)
  • Same hardware (RTX 4090), same data (codeparrot-clean, 5.3B tokens available)

Good (Phase 5 complete)

  • Distillation from Qwen3.5-35B-A3B demonstrated (ALB-010); fallback: Qwen2.5-Coder-3B (dense)
  • albor-distill-350m outperforms albor-base-350m on all code benchmarks
  • HumanEval pass@1 > 15% (beat CodeGen-350M-mono’s 12.8% via distillation from 35B MoE teacher)
  • MBPP pass@1 > 12%
  • FIM infill working (qualitatively: model can complete Python between prefix and suffix)
  • KD contract at Level 4 (Kani-proved KL non-negativity)
  • All FALSIFY-ALBOR tests pass (001-006)

Full Success (Phase 8 complete)

  • All 6 model variants benchmarked (base → distill → instruct → merged → pruned → q4)
  • Benchmark trajectory published showing improvement at each stage
  • Submitted to Big Code Models Leaderboard — first sub-1B model on the board
  • Q4 model: <50ms/token on CPU, <10ms/token on GPU (code completion latency)
  • Critical path gaps (ALB-001, 006, 009, 011, 018) closed with upstream fixes; ALB-010 (Qwen3.5-35B-A3B MoE inference) PR #133 MERGED, weight loading remaining
  • Models published on HuggingFace as paiml/albor-python-*
  • Q4 quantized model < 100MB, runs on consumer hardware
  • All 8 kernel contracts written and verified (ALB-013–017, ALB-039–040, ALB-060)
  • batuta falsify: Toyota Standard grade (≥90/108) — ACHIEVED: 100% (108/108 PASS)
  • pmat TDG: Grade A on all touched components
  • Test coverage ≥ 95%, mutation score ≥ 85% on all new code
  • All 9 FALSIFY-ALBOR tests pass
  • Verification DAG published via pv graph

Stretch Goals

  • HumanEval pass@1 > 20% (strong distillation result at 350M)
  • DS-1000 pass@1 > 10% (data science code generation)
  • Editor integration: VS Code / Neovim / Helix extension using realizar as backend
  • Distributed gradient-parallel training across 4090 + W5700X demonstrated (entrenar DDP #133 infra in place)
  • apr pipeline apply reproduces entire ladder from bare metal to published model
  • BabyLM 2026 submission using constrained data variant
  • All critical kernels at Level 4 (Kani formal proofs)
  • Lean 4 theorem stubs generated for core training loop invariants