Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Model Card: albor-base-350m

Model Details

FieldValue
Namealbor-base-350m
Version1.0 (base pre-training)
TypeDecoder-only Transformer (Qwen2-style)
Parameters398.5M
Architecturehidden=1024, layers=24, heads=16, kv_heads=4, ffn=4096
Vocab Size32,768 (ByteLevel BPE v2, whitespace-preserving)
Context Length2,048 tokens
Training Datav1: 22,079 seqs (45.2M tokens); v2: 67,977 seqs (139M tokens, Tier 1 10x + 8 Tier 2 repos + 50% FIM)
Training Time~20 hours on RTX 4090 (full run); 396s for 50-step test
Frameworkentrenar + realizar (CUDA, CudaTransformerTrainer)

Intended Use

Base pre-training model. This model learns Python code patterns from pre-tokenized data. It serves as the foundation for:

  1. Knowledge distillation from Qwen3-Coder-Next (Phase 4)
  2. Fine-tuning with LoRA (Phase 6)
  3. Post-training optimization: pruning, merging, quantization (Phase 6)

Training Details

  • Optimizer: AdamW (lr=3e-4, beta1=0.9, beta2=0.95, wd=0.1)
  • Scheduler: Cosine with warmup (v1: 2000 steps; v2: 500 steps per C-TRAINCFG-001)
  • Gradient Accumulation: 128 (effective batch = 4 × 128 × 1024 = 512K tokens)
  • Mixed Precision: fp16
  • Epochs: v1: 117 (22K seqs); v2: 38 (68K seqs) — ALB-060: original epochs=1 was fatal
  • Max Steps: 5,000
  • Loss (50-step test): 10.39 → 5.92 (best 5.53) — convergence verified (post ALB-059 GEMM backward fix)
  • Perplexity (50-step test): ~31,926 (finite; random baseline ~32,768)
  • Loss (full run): TBD — first run failed (ALB-060), retraining with v2 config
  • Perplexity (full run): TBD
  • CUDA Mode: GPU-resident training via CudaTransformerTrainer (ALB-040), 3 PCIe transfers/step

Tokenizer

  • Type: ByteLevel BPE (v2)
  • Vocab: 32,768 tokens
  • Preserves: Whitespace, indentation, newlines (critical for Python)
  • Source: Trained with Python tokenizers library on 100K lines of Python code
  • Location: models/albor-tokenizer-v2/tokenizer.json

FALSIFY Predictions

IDPredictionStatus
FALSIFY-ALBOR-001Loss decreases monotonicallyCORROBORATED (50M: 10.3→4.42; 350M CUDA 50-step: 10.39→5.92)
FALSIFY-ALBOR-002Gradient norms boundedPENDING (per-step logging available via ALB-035)
FALSIFY-ALBOR-003Checkpoint determinismUNTESTED

Evaluation

BenchmarkMetricResult
Training loss (50-step test)cross-entropy10.39 → 5.92 (best 5.53)
Training perplexity (50-step test)exp(loss)~31,926 (finite)
Checkpoint validationweights trained?PASS (layers distinct, not init)
realizar inferenceloads + generates?PASS (218 tensors, 50 tokens generated)
HumanEval (20 problems)pass@1TBD (after full training)
Python intermediate (15 problems)pass@1TBD (after full training)

Limitations

  1. 139M tokens on v2 (typical base models train on 10B+ tokens)
  2. Python-only training data (no multilingual code)
  3. v2 dataset includes 50% FIM (PSM format via alimentar fim)
  4. Checkpoint broken by ALB-038 FIXED — entrenar now saves trained weights correctly
  5. Evaluation blocked by ALB-037 FIXED — realizar loads trained checkpoint, generates tokens

Known Gaps

  • ALB-035 (FIXED): Per-step loss logging via train_epoch_with_callback() (entrenar@5d41a96)
  • ALB-037 (FIXED): realizar now loads trained checkpoint, generates tokens (e2e verified with 350M)
  • ALB-038 (FIXED): Broken autograd in RMSNorm::forward_batched() and MultiHeadAttention::forward(). Fixed in entrenar@91ba9da and entrenar@1ede409. All 20 model parameters now receive gradients.
  • ALB-040 (VERIFIED): GPU-resident pretraining via CudaTransformerTrainer. 3 PCIe transfers/step vs ~16K. 350M CUDA test: 50 steps, loss 10.39→5.92 (best 5.53), checkpoint valid.
  • ALB-060 (FIXED): Training config epochs=1 only ran 43/5000 steps. C-TRAINCFG-001 contract written. v2 config uses epochs=38 with expanded 68K-sequence dataset.
  • ALB-041 (FIXED): D2D buffer size mismatch in backward_attention(). Fixed in entrenar@a48e3d2. Was blocking GPU backward pass.
  • ALB-043 (FIXED): backward_ffn buffer overflow + missing SwiGLU gradients. Fixed in entrenar@f7805f1.
  • ALB-044 (FIXED): Activation gradient clipping at GPU-CPU boundary + CPU optimizer hyperparams (beta2/wd mismatch). Fixed in entrenar@86eec38.
  • ALB-059 (FIXED): GEMM backward constructor args n/k swapped — output stride baked wrong into PTX, rows overflow 64× into adjacent optimizer states (m_w_k, v_w_k). Negative v values → sqrt(neg) = NaN in AdamW. Also zero-initialized all optimizer m/v buffers (cuMemAlloc returns uninitialized VRAM). Fixed in entrenar@846ae0c.

Data Provenance

See docs/PROVENANCE.md for full SHA-256 hashes of all data artifacts.

Checkpoint

  • Test checkpoint: checkpoints/albor-350m-cuda-test/model.safetensors (1.59 GB, 218 tensors)
  • Full checkpoint: checkpoints/albor-base-350m/model.safetensors (TBD — training in progress)
  • Metadata: checkpoints/albor-base-350m/final_model.json
  • Config (test): configs/train/pretrain-350m-cuda-test.yaml
  • Config (full): configs/train/pretrain-350m.yaml