Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

1. Objectives

1.1 Primary Goal

Train, distill, and optimize a 350M-parameter decoder-only transformer using exclusively the Sovereign AI stack:

  • apr for training, distillation, merging, pruning, quantization, eval, export
  • alimentar for data loading and preprocessing
  • forjar for pipeline orchestration (DAG engine, multi-machine, state tracking)
  • bashrs (Rash) for shell fragment validation in pipeline task resources
  • repartir for distributed compute
  • entrenar for the training engine (autograd, optimizers, checkpointing)
  • trueno for SIMD/GPU tensor operations
  • realizar for inference (teacher model, eval, serving)
  • presentar for training visualization (TUI dashboards, experiment browser, WASM)
  • batuta for orchestration, stack coordination, and falsification
  • pv (provable-contracts) for design-by-contract verification of every kernel
  • pmat for TDG scoring, compliance, fault pattern analysis, and coverage gaps
  • certeza for three-tier test effectiveness (unit → property → formal)

1.2 Secondary Goal (Stack Validation)

Identify every implementation gap that blocks the primary goal. Fix each gap in the correct upstream component. The model is the proof; the stack improvements are the lasting value.

1.3 Multi-Stage Improvement Ladder

The model is not a single training run — it is iteratively improved through every post-training technique available in apr. Each stage exercises a different part of the stack, produces a benchmarked checkpoint, and may reveal new gaps.

Stage 1: Pre-train base model         → albor-base
Stage 2: Distill from Qwen3-Coder-Next → albor-distill
Stage 3: Instruction fine-tune (LoRA)  → albor-instruct
Stage 4: Merge with complementary model → albor-merged
Stage 5: Prune for efficiency          → albor-pruned
Stage 6: Quantize for deployment       → albor-q4

1.4 Target Use Cases

Primary: Sovereign Code Assist

A tiny, fast, zero-dependency code completion model that runs entirely locally. No API calls, no Python runtime, no telemetry, no cloud. Distillation from Qwen3-Coder-Next gives it coding capability far above what 350M parameters normally achieve.

CapabilityDescription
Python code completionLeft-to-right next-token prediction in .py files
Fill-in-the-middle (FIM)Insert Python code between existing prefix and suffix (PSM/SPM)
Single-line infillComplete the current line given surrounding context
Multi-line body generationGenerate function bodies, loop contents, comprehensions, decorators
On-device inferenceRuns on laptops, Raspberry Pi, browsers (WASM via trueno)
Latency target<50ms per token on CPU (Q4), <10ms on GPU

Language: Python only. Following the phi-1 playbook — maximum concentration on a single language produces dramatically better results at small param counts than spreading tokens across many languages. A 350M model that completes Python well is more useful than a 350M model that completes 10 languages poorly.

What Albor is NOT: It is not a chat model, not an instruction follower, not a reasoning engine, not a polyglot code model. It is a fast, local Python code completion kernel — the kind of model that lives inside an editor extension and fires on every keystroke.

Secondary: Stack Demonstration & Teaching Artifact

The model exists equally to prove the Sovereign AI stack can train, distill, optimize, and serve an LLM end-to-end in pure Rust. The HuggingFace model card is a tour of the stack. The reproducibility protocol means anyone can retrain from scratch using only apr commands.

AudienceWhat They Get
DevelopersA code completion model they can self-host with zero dependencies
ResearchersA fully reproducible training recipe with provable quality contracts
Stack usersProof that aprender/entrenar/trueno/realizar handle real LLM workloads
EducatorsA case study in first-principles LLM training (data → deploy in Rust)

1.5 What Albor Builds

Albor is a project repo, not a library. It contains no production Rust code. All Rust changes happen upstream in the sovereign stack components. Albor drives the upstream work, validates it end-to-end, and produces the model.

1.5.1 What Lives in Albor (This Repo)

albor/
├── docs/
│   ├── specifications/albor-llm-spec.md    # This spec
│   ├── model-card.md                       # HuggingFace model card
│   └── falsification-report.md             # batuta falsify output
├── configs/
│   ├── train/
│   │   ├── pretrain-50m.yaml              # 50M: model arch + training (pipeline validation)
│   │   ├── pretrain-125m.yaml             # 125M: model arch + training (intermediate)
│   │   ├── pretrain-350m.yaml             # 350M: model arch + training (final)
│   │   ├── distill.yaml                   # Distillation config
│   │   └── finetune-lora.yaml             # LoRA fine-tuning config
│   ├── pipeline/
│   │   └── albor.yaml                      # THE manifest: infra + data + train + eval + publish
│   ├── dashboard/
│   │   └── albor-dashboard.yaml            # presentar dashboard (TUI + WASM)
│   └── data-mix.yaml                       # Data source weights + upsampling
├── contracts/
│   ├── knowledge-distillation-kernel-v1.yaml  # ALB-013
│   ├── bpe-tokenizer-kernel-v1.yaml           # ALB-014
│   ├── model-merging-kernel-v1.yaml           # ALB-015
│   ├── pruning-kernel-v1.yaml                 # ALB-016
│   └── gradient-accumulation-kernel-v1.yaml   # ALB-017
├── tests/
│   ├── falsify/                            # FALSIFY-ALBOR-001 through 009
│   ├── integration/                        # End-to-end pipeline tests
│   └── smoke/                              # Quick sanity checks (50M model)
├── state/                                  # (gitignored) forjar state + locks
│   ├── lambda/state.lock.yaml              # Per-machine resource state
│   ├── intel/state.lock.yaml
│   └── forjar.lock.yaml                    # Global pipeline state
├── data/                                   # (gitignored) Training data
├── checkpoints/                            # (gitignored) Model checkpoints
└── eval/                                   # (gitignored) Evaluation results

1.5.2 apr as Unified Entry Point

apr is the single CLI for all model operations. It delegates to sibling projects (entrenar, alimentar, realizar, etc.) under the hood. If a subcommand doesn’t exist yet, we file a GitHub issue, implement it in the correct upstream repo, wire it into apr, dogfood it in albor, and close the issue.

Design Principle: Plan/Apply Everywhere

Every apr subcommand that touches data, compute, or infrastructure follows a plan/apply contract inspired by Terraform and forjar:

plan   → Validate inputs, estimate cost, show what WILL happen. No side effects.
apply  → Execute the plan. Mutates state (files, models, infrastructure).

This is not optional. It is the unifying design principle of the CLI. Every expensive operation gets a free dry-run. Every destructive operation shows you what it will do before it does it. Users never commit GPU hours, disk space, or network bandwidth without seeing the plan first.

The contract:

  1. apr <cmd> plan <config> — Parse config, validate paths, estimate resources (VRAM, disk, time, tokens), print a human-readable execution plan. Exit 0 if valid, exit 1 with diagnostics if not. No GPU, no writes, no network.
  2. apr <cmd> apply <config> — Execute. Reads the same config, does the work. Can be interrupted and resumed.
  3. apr <cmd> validate <config> — Alias for plan with --strict schema-only checking (no resource estimation). Fast enough for CI.

Why this matters for albor: Training a 350M model for 7 days on a 4090 is not something you retry casually. A config typo caught at plan time saves days. A VRAM overestimate caught at plan time prevents OOM crashes at step 15,000. Plan/apply turns “hope it works” into “prove it will work, then run it.”

Dispatch Table
apr <subcommand>
├── pipeline plan/apply      → forjar DAG engine (THE entry point — runs everything)
├── tokenize plan/apply      → aprender BPE tokenizer
├── train plan/apply         → entrenar TransformerTrainer
├── distill plan/apply       → entrenar + realizar (precompute + student training)
├── finetune plan/apply      → entrenar LoRA/QLoRA
├── eval plan/apply          → aprender eval harness
├── merge plan/apply         → entrenar SLERP/TIES/DARE
├── prune plan/apply         → entrenar WANDA/magnitude
├── quantize plan/apply      → entrenar Q4/Q8
├── export plan/apply        → entrenar SafeTensors/GGUF
├── publish plan/apply       → entrenar HuggingFace Hub
├── bench plan/apply         → realizar latency benchmarks
├── provision plan/apply     → forjar infrastructure convergence
├── experiment view/export   → presentar TUI + entrenar SQLite
└── monitor                  → presentar live TUI (reads training_state.json)

apr pipeline is the top-level command. It reads a single YAML manifest that describes infrastructure resources AND training tasks in one DAG. Forjar’s engine resolves dependencies (Kahn’s toposort), tracks state (BLAKE3 hashes), and dispatches each step — calling back into apr subcommands for ML tasks. Individual subcommands (apr train, apr eval, etc.) still work standalone for development and debugging.

Plan Output Format

Every plan subcommand prints a structured summary:

$ apr train plan configs/train/pretrain-350m.yaml

  Albor Train Plan
  ─────────────────────────────────────────────
  Model:        llama (24L, 1024H, 16A, 4KV)
  Parameters:   354,267,136 (~354M)
  Precision:    fp16 mixed
  ─────────────────────────────────────────────
  VRAM Budget:
    Weights       700 MB
    Optimizer   2,800 MB   (AdamW fp32 m+v)
    Gradients     700 MB
    Activations 9,200 MB   (grad ckpt, batch=8, seq=2048)
    Total      13,400 MB   (55.8% of 24,576 MB)
    Headroom   11,176 MB   ✓
  ─────────────────────────────────────────────
  Data:
    Train shards  data/tokenized/train/ (47 files, 8.2 GB)
    Val shards    data/tokenized/val/   (3 files, 410 MB)
    Tokenizer     models/albor-tokenizer/tokenizer.json ✓
    Vocab match   32,768 = model.vocab_size ✓
  ─────────────────────────────────────────────
  Training:
    Global batch  524,288 tokens (8 × 32 × 2048)
    Total tokens  10,000,000,000 (~10B)
    Total steps   19,073
    Warmup        2,000 steps (10.5%)
    Checkpoints   19 (every 1,000 steps)
    Disk est.     ~13.3 GB (19 × 700 MB)
  ─────────────────────────────────────────────
  Estimated wall time: 5.2 days on RTX 4090
  ─────────────────────────────────────────────
  ✓ Plan valid. Run `apr train apply configs/train/pretrain-350m.yaml` to start.

Forjar already does this (forjar plan -f albor.yaml). Entrenar has the TrainingPlan module (training_plan.rs) that mirrors forjar’s architecture. Albor’s job is to close the loop: every apr subcommand gets plan/apply, and every gap (ALB-XXX) that adds a new subcommand must implement both phases.

What Plan Validates Per Subcommand
SubcommandPlan Checks
tokenizeInput Parquet exists, vocab size valid, output dir writable, estimated time
trainYAML schema, model arch sanity (divisibility, KV ratio), VRAM budget, data paths, tokenizer vocab match, checkpoint disk estimate
distillTeacher model loadable (RAM check), student checkpoint exists, logit output dir writable, temperature/alpha valid
finetuneBase model exists, LoRA rank/alpha valid, dataset format, VRAM with adapters
evalModel checkpoint exists, benchmark tasks recognized, output dir writable
mergeAll input models exist and have compatible architectures, merge method valid
pruneModel exists, sparsity ratio in [0,1], method recognized, output size estimate
quantizeModel exists, target format valid (Q4/Q8), output size estimate
exportModel exists, format valid (SafeTensors/GGUF), output path writable
publishModel + model card exist, HF token valid, repo name available
provisionforjar plan: SSH reachable, packages installable, GPU drivers, disk space

1.5.3 Development Workflow: Issue-Driven Dogfooding

When albor hits a wall — a missing subcommand, a broken feature, a gap in a sibling project — the workflow is:

1. Hit wall       → apr <subcommand> doesn't exist or fails
2. File issue     → GitHub issue on correct repo (aprender, entrenar, alimentar, etc.)
3. Implement      → Fix upstream in the correct component
4. Wire into apr  → Add/update apr subcommand if needed
5. Dogfood        → Run the blocked albor pipeline step
6. Prove          → Tests pass, FALSIFY test passes, pmat comply check
7. Close issue    → Link to albor gap ID (ALB-XXX)

Every ALB-XXX gap in the gap register (§11) maps to a GitHub issue. The gap is not “closed” until the apr subcommand works end-to-end in the albor pipeline.

1.5.4 What Lives Upstream (Other Repos)

Upstream RepoWhat Albor Adds to ItGaps
aprender (apr)pipeline plan/apply, tokenize plan/apply, distill plan/apply, eval plan/apply, train plan/apply, plan/apply contract enforcementALB-001, 006, 009, 011, 023, 028
alimentarimport local, mix with upsampling, FIM transforms, streaming to entrenarALB-007, 018, 019, 020
realizarQwen3-Coder-Next / DeltaNet / MoE architecture supportALB-010
entrenarTraining engine, model merging, pruning, quantization, LoRA, custom YAML model arch, human-readable config valuesALB-003, 004, 021, 022
forjartask resource type for ML pipeline orchestration, DAG engine for apr pipelineALB-027
presentarSQLite experiment viewer, live training TUI, WASM dashboard, apr experiment CLIALB-024, 025, 026
bashrsShell fragment validation for all task resource command: fields(used by ALB-027)
truenowgpu backward pass (stretch)ALB-005
repartirRing all-reduce (stretch), heterogeneous balancingALB-002, 008
provable-contracts5 new kernel contracts (KD, BPE, merging, pruning, grad accum)ALB-013–017

1.5.5 Where Quality Constraints Apply

ConstraintApplies ToNOT To
95% test coverageUpstream Rust code we modify (aprender, entrenar, alimentar, etc.)Albor’s shell scripts and YAML configs
85% mutation scoreUpstream Rust code we modifyAlbor configs
500-line file limitALL files: upstream Rust, albor scripts, YAML configs, contractsGenerated output (eval results, logs)
TDG grade AUpstream Rust code via pmatAlbor shell scripts
Zero clippy warningsUpstream Rust codeN/A
pmat comply checkEach upstream repo after modificationAlbor repo itself
Contract verificationUpstream kernel implementationsAlbor orchestration
FALSIFY-ALBOR testsThe albor pipeline end-to-endIndividual upstream unit tests

The albor repo has no Rust code to cover. Its quality is measured by:

  • Do the configs work? (integration tests)
  • Do the FALSIFY tests pass? (end-to-end validation)
  • Are the contracts complete? (pv status)
  • Does the pipeline reproduce? (deterministic re-run)

1.6 Constraints

  • Zero Python dependencies — Pure Rust from data to deployment
  • Scientifically reproducible — Fixed seeds, versioned data, deterministic training
  • Publicly auditable — All data, code, hyperparameters, and training logs published
  • apr only — Every model operation uses an apr <subcommand>. Missing commands are gaps to implement.
  • Plan/apply everywhere — Every apr subcommand implements plan (dry-run, no side effects) and apply (execute). No GPU time without a passing plan.
  • One manifest, one DAGapr pipeline plan/apply configs/pipeline/albor.yaml orchestrates the entire pipeline. No Makefiles, no shell scripts. Forjar’s DAG engine handles dependency resolution, state tracking, multi-machine dispatch, and resumability.
  • bashrs linted — All shell fragments in forjar task resources are validated by bashrs (Rash). No unvalidated shell.
  • No file over 500 lines — Applies to all code, scripts, configs, and contracts (not docs/specs)
  • Provably correct — Every kernel has a YAML contract with falsification tests and Kani proofs
  • pmat compliant — Upstream changes: TDG grade A, 95% coverage, 85% mutation score, zero SATD
  • Falsifiable — Every claim in this spec has a concrete test that could disprove it

1.7 Sovereign Stack vs. Standard ML Stack

Most LLM training stacks depend on a deep tower of NVIDIA and Python libraries:

Standard ML Stack              Sovereign Stack (albor)
─────────────────              ──────────────────────
Python                         Rust (no Python runtime)
PyTorch / JAX                  entrenar (training engine)
cuDNN                          trueno PTX kernels + cuBLAS FFI
NCCL                           (not needed — single GPU)
torch.distributed              repartir (stretch goal)
Weights & Biases               presentar + renacer tracing
HuggingFace Transformers       realizar (inference)

What each replaced component does — and why we don’t use it:

ComponentWhat It DoesWhy Albor Doesn’t Use It
PyTorchAutograd, tensor ops, training loopentrenar implements autograd, AdamW, checkpointing in Rust. No Python GIL, no dynamic graph overhead.
cuDNNOptimized GPU kernels for conv, norm, attentiontrueno provides hand-written PTX kernels (RMSNorm, SiLU, softmax, cross-entropy) and cuBLAS FFI for GEMM. Every kernel has a provable contract.
NCCLMulti-GPU collective communication (all-reduce, broadcast, scatter)Albor trains on a single RTX 4090. No multi-GPU communication needed. For future multi-GPU work, repartir would implement ring all-reduce directly.
torch.distributedDistributed training orchestration (DDP, FSDP)Single-GPU training. The model (370M params, ~1.5 GB) fits entirely in 24 GB VRAM with optimizer states.
Weights & BiasesExperiment tracking, dashboardsrenacer provides structured tracing with BrickTracer spans. presentar provides TUI dashboards and WASM visualization.

The GPU interface: The sovereign stack talks to NVIDIA hardware through two interfaces only:

  1. CUDA Driver API (libcuda.so) — Memory allocation, kernel launch, stream management, device queries. This is the lowest stable NVIDIA API. trueno binds it directly via Rust FFI — no CUDA Runtime API (libcudart) dependency.

  2. cuBLAS (libcublas.so) — Matrix multiplication (GEMM). The only NVIDIA library used for compute. trueno wraps it with a safe Rust API (CublasHandle, CublasGemm) that enforces correct argument order at the type level. cuBLAS replaced hand-written PTX GEMMs in ALB-075, improving throughput from 890 tok/s to 6,700 tok/s (7.5x).

What this means in practice: The entire training binary is a single statically-linked Rust executable (~15 MB). It has no Python interpreter, no pip packages, no conda environment, no Docker container, no version conflicts between PyTorch and CUDA toolkit. cargo build --release produces a binary that runs training. The only runtime dependencies are libcuda.so (NVIDIA driver) and libcublas.so (ships with the driver).