17. Success Criteria

350M base model trained on 4090 to convergence (target: ~10B tokens; current: 139M v2 dataset)
FIM (fill-in-the-middle) training implemented and validated (~~ALB-018~~ FIXED — alimentar fim verified)
HumanEval pass@1 > 8% (baseline Python capability, beat random)
HumanEval-FIM working (model can infill Python code)
Entire pipeline uses only sovereign stack components
All training artifacts reproducible from spec
All existing kernel contracts pass pv audit (Level 2+)
pmat comply check passes on all modified components

Current blockers for Phase 3 completion:

~~ALB-038 (Critical): entrenar saves initialization weights, not trained weights~~ FIXED (entrenar@91ba9da, @1ede409)
~~ALB-035: No per-step loss logging during training~~ FIXED (entrenar@5d41a96)
~~ALB-041: D2D buffer mismatch in backward_attention~~ FIXED (entrenar@a48e3d2)
~~ALB-037: realizar ignores loaded weights~~ FIXED (e2e verified: realizar run loads 350M trained checkpoint, generates tokens from 218 tensors)
~~ALB-043 (Critical): backward_ffn buffer overflow + missing SwiGLU gradients~~ FIXED (entrenar@f7805f1)
~~ALB-044 (Critical): activation gradient clipping + CPU optimizer hyperparams~~ FIXED (entrenar@86eec38)
~~ALB-059 (Critical): GEMM backward constructor n/k swapped — buffer overflow into optimizer states~~ FIXED (entrenar@846ae0c)
~~ALB-040: GPU-resident pretraining~~ VERIFIED — 350M CUDA test: 50 steps, loss 10.39→5.92, checkpoint valid, realizar inference works
ALB-042: CUDA runtime errors produce silent loss=0.0 — OPEN (workaround: CUDA_VISIBLE_DEVICES="")
~~ALB-069 (Critical): PTX selp_f32 argument order in fused cross-entropy~~ FIXED (trueno@10bec89)
~~ALB-060 (Critical)~~: Training ran only 43/5000 steps (epochs=1). CONFIG FIXED: C-TRAINCFG-001 contract + v2 config. V2 training (ALB-063) restarted after ALB-069 fix — PID 106929, loss=10.39 at step 1.

350M CUDA test results (50 steps, post ALB-059 fix):

Loss: 10.39 → 5.92 (best: 5.53) — clear convergence with correct GEMM backward
Training time: ~400s (~8s/step) with PTX; ~26s (~0.5s/step) with cuBLAS (ALB-075/077)
Checkpoint: 1.59 GB SafeTensors, 218 tensors, config.json saved
Checkpoint validation: PASS (weights trained, layers distinct)
realizar inference: loads model, generates tokens (gibberish at 50 steps — expected)
Perplexity: 31,926 (finite; random baseline ~32,768 for vocab 32K)

350M v3 training (250K steps, codeparrot-clean, ALB-077 fix) — STOPPED:

Final: step 28K, loss=6.43, val_ppl=1018, 6.7K tok/s, 19.3% MFU
Plateau since step 12K — val_ppl stalled at ~1000, gnorm collapsed 3.0→0.13
Root cause: ALB-079 (constant lr after warmup, no cosine decay) + ALB-080 (4K tokens/step, 48-128x too small)
Checkpoints: step 1K-28K (1520 MB each, all verified OK)
No NaN in 28K steps (ALB-077: tensor cores disabled, CUBLAS_DEFAULT_MATH)

350M v4 training (ALB-079 + ALB-080 fixes) — RESUMED from step 500:

Fixes: cosine LR decay (entrenar PR #241) + gradient_accumulation=32 (131K tokens/step)
Original run: 500 steps, val_ppl=1032.7 (matched v3 at 57% token budget)
System reboot at step 553; resumed from step-500 checkpoint
Extended resume: step 350 (cum. step 850), best loss=5.69 at step 262
111M tokens processed (2.1% of 5.3B available); loss plateau at mean ~6.65
Cosine decay just engaging (lr 3.00e-4→2.98e-4); expect plateau break at step 1000+
ZClip catching gradient spikes (z=2.0–4.0), gnorm healthy 0.05–0.32
Throughput: 3,564–3,569 tok/s steady, 10.3% MFU, 14-16 GB / 24 GB VRAM
Target: val_ppl < 100 by 1B tokens (~60 hours remaining)
Same hardware (RTX 4090), same data (codeparrot-clean, 5.3B tokens available)

Distillation from Qwen3.5-35B-A3B demonstrated (ALB-010); fallback: Qwen2.5-Coder-3B (dense)
albor-distill-350m outperforms albor-base-350m on all code benchmarks
HumanEval pass@1 > 15% (beat CodeGen-350M-mono’s 12.8% via distillation from 35B MoE teacher)
MBPP pass@1 > 12%
FIM infill working (qualitatively: model can complete Python between prefix and suffix)
KD contract at Level 4 (Kani-proved KL non-negativity)
All FALSIFY-ALBOR tests pass (001-006)

All 6 model variants benchmarked (base → distill → instruct → merged → pruned → q4)
Benchmark trajectory published showing improvement at each stage
Submitted to Big Code Models Leaderboard — first sub-1B model on the board
Q4 model: <50ms/token on CPU, <10ms/token on GPU (code completion latency)
Critical path gaps (ALB-001, 006, 009, 011, 018) closed with upstream fixes; ALB-010 (Qwen3.5-35B-A3B MoE inference) PR #133 MERGED, weight loading remaining
Models published on HuggingFace as paiml/albor-python-*
Q4 quantized model < 100MB, runs on consumer hardware
All 8 kernel contracts written and verified (ALB-013–017, ALB-039–040, ALB-060)
batuta falsify: Toyota Standard grade (≥90/108) — ACHIEVED: 100% (108/108 PASS)
pmat TDG: Grade A on all touched components
Test coverage ≥ 95%, mutation score ≥ 85% on all new code
All 9 FALSIFY-ALBOR tests pass
Verification DAG published via pv graph

HumanEval pass@1 > 20% (strong distillation result at 350M)
DS-1000 pass@1 > 10% (data science code generation)
Editor integration: VS Code / Neovim / Helix extension using realizar as backend
Distributed gradient-parallel training across 4090 + W5700X demonstrated (entrenar DDP #133 infra in place)
apr pipeline apply reproduces entire ladder from bare metal to published model
BabyLM 2026 submission using constrained data variant
All critical kernels at Level 4 (Kani formal proofs)
Lean 4 theorem stubs generated for core training loop invariants

Keyboard shortcuts