17. Success Criteria
Minimum Viable (Phase 3 complete)
- 350M base model trained on 4090 to convergence (target: ~10B tokens; current: 139M v2 dataset)
- FIM (fill-in-the-middle) training implemented and validated (
ALB-018FIXED —alimentar fimverified) - HumanEval pass@1 > 8% (baseline Python capability, beat random)
- HumanEval-FIM working (model can infill Python code)
- Entire pipeline uses only sovereign stack components
- All training artifacts reproducible from spec
- All existing kernel contracts pass
pv audit(Level 2+) -
pmat comply checkpasses on all modified components
Current blockers for Phase 3 completion:
ALB-038 (Critical): entrenar saves initialization weights, not trained weightsFIXED (entrenar@91ba9da,@1ede409)ALB-035: No per-step loss logging during trainingFIXED (entrenar@5d41a96)ALB-041: D2D buffer mismatch in backward_attentionFIXED (entrenar@a48e3d2)ALB-037: realizar ignores loaded weightsFIXED (e2e verified:realizar runloads 350M trained checkpoint, generates tokens from 218 tensors)ALB-043 (Critical): backward_ffn buffer overflow + missing SwiGLU gradientsFIXED (entrenar@f7805f1)ALB-044 (Critical): activation gradient clipping + CPU optimizer hyperparamsFIXED (entrenar@86eec38)ALB-059 (Critical): GEMM backward constructor n/k swapped — buffer overflow into optimizer statesFIXED (entrenar@846ae0c)ALB-040: GPU-resident pretrainingVERIFIED — 350M CUDA test: 50 steps, loss 10.39→5.92, checkpoint valid, realizar inference works- ALB-042: CUDA runtime errors produce silent loss=0.0 — OPEN (workaround:
CUDA_VISIBLE_DEVICES="") ALB-069 (Critical): PTX selp_f32 argument order in fused cross-entropyFIXED (trueno@10bec89)ALB-060 (Critical): Training ran only 43/5000 steps (epochs=1). CONFIG FIXED: C-TRAINCFG-001 contract + v2 config. V2 training (ALB-063) restarted after ALB-069 fix — PID 106929, loss=10.39 at step 1.
350M CUDA test results (50 steps, post ALB-059 fix):
- Loss: 10.39 → 5.92 (best: 5.53) — clear convergence with correct GEMM backward
- Training time: ~400s (~8s/step) with PTX; ~26s (~0.5s/step) with cuBLAS (ALB-075/077)
- Checkpoint: 1.59 GB SafeTensors, 218 tensors, config.json saved
- Checkpoint validation: PASS (weights trained, layers distinct)
- realizar inference: loads model, generates tokens (gibberish at 50 steps — expected)
- Perplexity: 31,926 (finite; random baseline ~32,768 for vocab 32K)
350M v3 training (250K steps, codeparrot-clean, ALB-077 fix) — STOPPED:
- Final: step 28K, loss=6.43, val_ppl=1018, 6.7K tok/s, 19.3% MFU
- Plateau since step 12K — val_ppl stalled at ~1000, gnorm collapsed 3.0→0.13
- Root cause: ALB-079 (constant lr after warmup, no cosine decay) + ALB-080 (4K tokens/step, 48-128x too small)
- Checkpoints: step 1K-28K (1520 MB each, all verified OK)
- No NaN in 28K steps (ALB-077: tensor cores disabled, CUBLAS_DEFAULT_MATH)
350M v4 training (ALB-079 + ALB-080 fixes) — RESUMED from step 500:
- Fixes: cosine LR decay (entrenar PR #241) + gradient_accumulation=32 (131K tokens/step)
- Original run: 500 steps, val_ppl=1032.7 (matched v3 at 57% token budget)
- System reboot at step 553; resumed from step-500 checkpoint
- Extended resume: step 350 (cum. step 850), best loss=5.69 at step 262
- 111M tokens processed (2.1% of 5.3B available); loss plateau at mean ~6.65
- Cosine decay just engaging (lr 3.00e-4→2.98e-4); expect plateau break at step 1000+
- ZClip catching gradient spikes (z=2.0–4.0), gnorm healthy 0.05–0.32
- Throughput: 3,564–3,569 tok/s steady, 10.3% MFU, 14-16 GB / 24 GB VRAM
- Target: val_ppl < 100 by 1B tokens (~60 hours remaining)
- Same hardware (RTX 4090), same data (codeparrot-clean, 5.3B tokens available)
Good (Phase 5 complete)
- Distillation from Qwen3.5-35B-A3B demonstrated (ALB-010); fallback: Qwen2.5-Coder-3B (dense)
- albor-distill-350m outperforms albor-base-350m on all code benchmarks
- HumanEval pass@1 > 15% (beat CodeGen-350M-mono’s 12.8% via distillation from 35B MoE teacher)
- MBPP pass@1 > 12%
- FIM infill working (qualitatively: model can complete Python between prefix and suffix)
- KD contract at Level 4 (Kani-proved KL non-negativity)
- All FALSIFY-ALBOR tests pass (001-006)
Full Success (Phase 8 complete)
- All 6 model variants benchmarked (base → distill → instruct → merged → pruned → q4)
- Benchmark trajectory published showing improvement at each stage
- Submitted to Big Code Models Leaderboard — first sub-1B model on the board
- Q4 model: <50ms/token on CPU, <10ms/token on GPU (code completion latency)
- Critical path gaps (ALB-001, 006, 009, 011, 018) closed with upstream fixes; ALB-010 (Qwen3.5-35B-A3B MoE inference) PR #133 MERGED, weight loading remaining
- Models published on HuggingFace as
paiml/albor-python-* - Q4 quantized model < 100MB, runs on consumer hardware
- All 8 kernel contracts written and verified (ALB-013–017, ALB-039–040, ALB-060)
- batuta falsify: Toyota Standard grade (≥90/108) — ACHIEVED: 100% (108/108 PASS)
- pmat TDG: Grade A on all touched components
- Test coverage ≥ 95%, mutation score ≥ 85% on all new code
- All 9 FALSIFY-ALBOR tests pass
- Verification DAG published via
pv graph
Stretch Goals
- HumanEval pass@1 > 20% (strong distillation result at 350M)
- DS-1000 pass@1 > 10% (data science code generation)
- Editor integration: VS Code / Neovim / Helix extension using realizar as backend
- Distributed gradient-parallel training across 4090 + W5700X demonstrated (entrenar DDP #133 infra in place)
-
apr pipeline applyreproduces entire ladder from bare metal to published model - BabyLM 2026 submission using constrained data variant
- All critical kernels at Level 4 (Kani formal proofs)
- Lean 4 theorem stubs generated for core training loop invariants