15. Implementation Phases

Phase 0: Pipeline Manifest, Contracts & Quality Baseline (Week 1)

Write configs/pipeline/albor.yaml — full pipeline manifest (infra + data + train + eval + publish)
apr pipeline plan — validate entire DAG, estimate resources
apr pipeline apply --target cuda-driver --target vulkan-driver --target data-dir — provision infra
Verify trueno wgpu on W5700X via Vulkan (not Metal — Linux)
Verify trueno CUDA on 4090
Download Qwen3-Coder-Next to intel box, verify it loads in realizar
pmat tdg baseline create on all stack components
pv coverage contracts/ --binding — establish contract coverage baseline
batuta falsify . --critical-only — initial falsification assessment

Phase 1: Data Pipeline + Tokenizer Contract (Week 1-2)

Ingest local ground truth corpora via alimentar import local (fix ALB-019 if needed)
- depyler: examples/ + tdd-book/tests/ (~1,845 files, ~219K lines)
- hf-ground-truth-corpus (~11,928 files)
- jax-ground-truth-corpus (~2,697 files)
- vllm-ground-truth-corpus (~1,118 files)
Ingest local ML framework code (Tier 2, ~53K files)
Download external datasets via alimentar import hf (StarCoder Python, FineWeb-Edu)
Quality validation via alimentar quality check on all sources
Build weighted training mix with 10x upsampling on Tier 1 (fix ALB-020 if needed)
Write bpe-tokenizer-kernel-v1.yaml contract (ALB-014)
pv probar + pv kani on tokenizer contract
Train BPE tokenizer on mixed corpus (fix ALB-001 if needed)
Verify FALSIFY roundtrip: decode(encode(text)) = text for all test data
Tokenize all data into sharded Parquet
Apply FIM transforms to code sequences (fix ALB-018 if needed)
Create train/val/test splits via alimentar
Record SHA-256 hashes + provenance manifest for all data artifacts
pmat comply check --strict on alimentar changes

Phase 2: Pipeline Validation — 50M Model (Week 2) – COMPLETE

Write gradient-accumulation-kernel-v1.yaml contract (ALB-017)
Write configs/train/pretrain-50m.yaml (model arch + training + monitoring)
Train albor-50M on 4090 — 500 rows, 31 steps, 110.7s, loss 10.3→4.42
Validate apr monitor — ALB-025 FIXED (presentar widget migration complete)
Validate Andon alerts during full training run
~~Fix ALB-009~~ FIXED
Verify FALSIFY-ALBOR-001 (loss decreases) — CORROBORATED
Verify FALSIFY-ALBOR-002 (gradient bounds) — per-step logging now available (~~ALB-035~~ FIXED)
pv audit — PASS: 7/7 contracts, 0 findings
Milestone: Training loop converges ✓, contracts pass ✓

Phase 3: Base Model — 350M Pre-Training (Week 2-4) – IN PROGRESS

Write configs/train/pretrain-350m.yaml — pre-tokenized ByteLevel BPE v2, 22K×2048 tokens
Train albor-base-350m on 4090 — STARTED (2760 batches, ~20h est.)
Build evaluation infrastructure — eval-code.py, eval-perplexity.py, 35 benchmark problems
~~Fix ALB-038~~ FIXED — RMSNorm + attention backward ops, all 20 params receive gradients
~~Fix ALB-041~~ FIXED — D2D buffer size mismatch in backward_attention (entrenar@a48e3d2)
~~Fix ALB-043~~ FIXED — backward_ffn buffer overflow + SwiGLU gradients (entrenar@f7805f1)
~~Fix ALB-044~~ FIXED — activation gradient clipping at GPU-CPU boundary + CPU optimizer hyperparams (entrenar@86eec38)
~~Fix ALB-059~~ FIXED — GEMM backward constructor args n/k swapped, buffer overflow into optimizer states + zero-init optimizer m/v (entrenar@846ae0c)
Write training-memory-kernel-v1.yaml contract (ALB-039) — VRAM budget estimation
Write training-gpu-kernel-v1.yaml contract (ALB-040) — GPU-resident training invariants
Implement CudaTransformerTrainer (ALB-040) — 3 PCIe transfers/step vs ~16K
Dogfood CUDA training — 50M test: 3 steps, loss 10.4→11.7, GPU forward+backward working
~~ALB-037~~ FIXED — realizar loads trained SafeTensors checkpoint, generates tokens (e2e verified)
350M CUDA test training — 50 steps, loss 10.39→5.92 (best 5.53), checkpoint valid
realizar inference verified — 218 tensors loaded, generates from trained weights
Checkpoint validation: PASS (weights trained, not initialization)
Perplexity eval: 31,926 (finite, consistent with 50-step model — random baseline ~32,768)
~~Fix ALB-060~~ CONFIG FIXED — epochs=1 only ran 43/5000 steps. C-TRAINCFG-001 contract written. Config fixed (v1: epochs=117, v2: epochs=1 with 68K seqs)
Expand training data: Tier 1 10x + 8 Tier 2 repos → v2 dataset (67,977 seqs, 139M tokens)
~~Fix ALB-071~~ FIXED — embed gradient clipping decoupled from weight grad_clip (entrenar@d07d67d)
~~Fix ALB-072~~ FIXED — fp16 loss scaling (65536x) removed from fused CE kernel; all backward uses f32, no underflow risk (entrenar@44d3e74)
Full 350M v2 training — reached step 1183/5000, loss 10.40→6.85, val_ppl=1008. Crashed: ALB-073 (PTX selp) + ALB-074 (buffer overflow from stale binary). Step 1000 checkpoint saved (1520 MB).
~~Fix ALB-073~~ FIXED — fused_cross_entropy selp arg order, same class as ALB-069 (trueno@10bec89)
~~Fix ALB-074~~ FIXED — stale binary missed eval truncation fix. Rebuilt with entrenar@5c4c2d8.
Monitor training via apr monitor (ALB-025 FIXED)
Data scaling: Download codeparrot-clean (2M files, ~4.4B tokens) → pretokenize at 1024 → ~5.2M sequences
Full 350M v3 training — PENDING: 250K steps on ~1B tokens from codeparrot-clean. Config: pretrain-350m-v3.yaml. ETA ~10 days.
Validate loss curve, perplexity convergence
HumanEval pass@1 evaluation (target >8%)
Verify FALSIFY-ALBOR-003 (checkpoint determinism)
pmat tdg check-regression on all touched components
Milestone: HumanEval pass@1 > 8%, Perplexity < 30, TDG grade A maintained

Phase 4: Teacher Setup & Logit Pre-Computation (Week 3-5)

Fix ALB-010: Add Qwen3-Coder-Next support to realizar (stretch — 3-4 week blocker)
Download Qwen2.5-Coder-3B interim teacher (5.75 GiB, Apache 2.0) — unblocks distillation without ALB-010
Validate 3B teacher: apr distill --stage precompute works, RosettaStone handles sharded SafeTensors
Create distillation config: configs/train/distill-qwen3b.yaml (T=4.0, α=0.5, LoRA r=16)
Validate teacher inference on intel (CPU, fp16, 300GB RAM) — for 80B stretch goal
Write knowledge-distillation-kernel-v1.yaml contract (ALB-013) — DOGFOODING
pv kani on KD loss contract (KL non-negativity, temperature scaling)
~~Fix ALB-011~~ FIXED — apr distill --config --stage precompute|train works
Pre-compute 3B teacher logits on v2 dataset (background, 4-8h CPU)
Verify FALSIFY-ALBOR-006 (teacher logit integrity)
Store as sharded Parquet via alimentar
pmat comply check --strict on realizar changes
Milestone: Teacher logits verified, KD contract at Level 4

Phase 5: Knowledge Distillation (Week 5-6)

Implement apr distill apply with KD loss
Distill albor-base-350m → albor-distill-350m
Verify FALSIFY-ALBOR-004 (KL non-negativity in production)
Verify FALSIFY-ALBOR-005 (distillation improves benchmarks)
Benchmark: measure improvement over base
pv probar --binding on KD contract with actual training data
Milestone: >5% avg benchmark improvement, KD contract fully wired

Phase 6: Post-Training Optimization (Week 6-8)

Write model-merging-kernel-v1.yaml contract (ALB-015) — DOGFOODING
Write pruning-kernel-v1.yaml contract (ALB-016) — DOGFOODING
Fine-tune with LoRA: apr finetune → albor-instruct
Merge variants: apr merge --method slerp → albor-merged
Verify FALSIFY-ALBOR-007 (SLERP interpolation bound)
Prune: apr prune --method wanda → albor-pruned
Verify FALSIFY-ALBOR-008 (sparsity guarantee)
Quantize: apr quantize --method q4_k → albor-q4
Verify FALSIFY-ALBOR-009 (quantization fidelity)
Benchmark every variant
pv coverage contracts/ --binding — final contract coverage report
Milestone: Full ladder complete, all post-training contracts pass

Phase 7: Quality Assurance & Falsification Sweep (Week 8)

batuta falsify . --min-grade toyota-standard --verbose — full 108-item assessment
pmat rust-project-score --full on all touched components
pmat tdg check-regression --baseline — no quality regressions
pv graph contracts/ --format mermaid — publish verification DAG
pv status contracts/ — all contracts at Level 3+, critical at Level 4
cargo mutants --no-times on all new code — mutation score ≥ 85%
cargo llvm-cov — coverage ≥ 95% on all new code
Address any falsification failures or contract violations
Milestone: Toyota Standard grade, all quality gates green

Phase 8: Evaluation, Leaderboard Submission & Publication (Week 8-9)

Final eval on all benchmark tasks (all 6 model variants)
Run bigcode-evaluation-harness with leaderboard-standard params on best model
Submit PR to Big Code Models Leaderboard (community_results/ folder)
Export all models: SafeTensors + GGUF
apr publish to HuggingFace Hub as paiml/albor-*
Write model card with full reproducibility details + leaderboard results
Publish training logs, loss curves, eval trajectories
Publish verification report (contract status, falsification results)
batuta falsify . --format markdown --output docs/falsification-report.md
Milestone: Models on HuggingFace, leaderboard submission live, quality evidence published

Phase 9: Distributed Training — Stretch (Week 9+)

entrenar native DDP infrastructure (TCP wire protocol v2, GradientServer, WorkerClient, PerBlockGradientAccumulator, RingAllReduce) — entrenar#133
Wire DDP train_batch() into DistributedCudaTrainer — COMPLETE (train_loop_cuda_distributed, allreduce_impl, spawn_coordinator_thread)
Multi-process launcher — COMPLETE (rank 0 auto-spawns GradientServer, all ranks connect as WorkerClient via --distributed CLI flags)
wgpu backward pass in trueno (ALB-005) — for cross-vendor GPU support
Full distributed training: 4090 + W5700X x2
Milestone: Multi-GPU training demonstrated

Keyboard shortcuts

Albor LLM Specification