Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

5. Training Data

5.1 Data Philosophy

  • All datasets either locally owned (MIT/Apache 2.0) or publicly available with permissive licenses
  • Local-first: Sovereign ground truth corpora are our highest-quality data — curated, tested, type-annotated, and owned. They are upsampled to punch above their token weight.
  • Exact download URLs, versions, and SHA-256 hashes recorded for all external data
  • Preprocessing pipeline is deterministic (fixed seed, recorded transforms)
  • Quality validated by alimentar quality check

5.2 Data Mix (Target: ~10B tokens)

Current status (2026-03-05): v3 dataset in preparation — 2M Python files from codeparrot-clean (~4.4B tokens raw, ~5.3B pretokenized at seq_len=1024). v2 dataset had only 139M tokens (67,977 sequences × 2048), which is 0.9% of Chinchilla-minimum for 350M params. v3 provides sufficient data for 1B+ token training runs. See §5.4.2 for the v3 pipeline.

Following the phi-1 playbook: maximum concentration on Python. phi-1 proved that a small model (1.3B) with focused data and distillation can hit 50% HumanEval — outperforming models 10x its size trained on diluted multi-language corpora.

Key insight from phi-1: Data quality matters more than quantity at small param counts. A 350M model trained on 1B tokens of textbook-quality code can outperform a 350M model trained on 100B tokens of raw GitHub scrapes. We have ~71K curated Python files locally — this is our unfair advantage.

SourceTokens (est.)WeightLicenseRationale
StarCoder Python subset (HF)~4B40%Apache 2.0Bulk Python code diversity; aligns with Qwen3-Coder teacher
Local ground truth corpora (upsampled 10x)~50-100M raw → ~500M-1B effective10%MITHighest-quality anchor — see §5.2.1
Local ML framework code~200-400M10%MIT / Apache 2.0ML/AI Python patterns — see §5.2.2
FineWeb-Edu (subset)~2B20%ODC-BYEducational web text for docstring understanding
Python textbooks + tutorials (HF)~1B10%Apache 2.0 / CC“Textbooks Are All You Need” — public educational code
Python docs + PEPs + Stack Overflow~1B10%CC BY-SAAPI knowledge, idiomatic patterns

Total: ~10B tokens. Chinchilla-optimal for 350M params is ~7B; we slightly overtrain for benchmark performance (common practice in SmolLM, Phi-1.5).

Python concentration: 80% of training data is Python or Python-adjacent (code, textbooks, docs). The remaining 20% (FineWeb-Edu) provides general language understanding for docstrings, comments, and natural language prompts.

5.2.1 Local Ground Truth Corpora (Tier 1 — Upsampled)

These are our “textbook-quality” data — the phi-1 equivalent. Every file has been curated, tested to 98%+ coverage, and validated by CI. They are upsampled 10x during training because their per-token teaching signal is 10-100x higher than raw GitHub code.

CorpusPathFilesLines (est.)Quality Signal
depyler examples + tdd-book../depyler/examples/, ../depyler/tdd-book/1,845~219KType-annotated, transpiler-validated, 27 stdlib modules, property-tested
hf-ground-truth-corpus../hf-ground-truth-corpus/11,928~500K+98%+ test coverage, zero lint violations, production HF recipes
jax-ground-truth-corpus../jax-ground-truth-corpus/2,697~200K+100% test coverage, full type checking, numerical computing
vllm-ground-truth-corpus../vllm-ground-truth-corpus/1,118~100K+Production inference optimization code
Total17,588~1M+All MIT licensed, all CI-validated

Why upsampling works: phi-1’s “textbook” data was <10% of total tokens but had outsized impact on HumanEval. Our ground truth corpora share the same properties: clean types, complete docstrings, tested correctness, educational structure. The model sees these examples multiple times, reinforcing correct patterns over noisy GitHub code.

depyler corpus is uniquely valuable: Every Python function in the depyler corpus was validated by a transpiler — it has clear types, clean control flow, and provably correct semantics. The tdd-book covers 27 stdlib modules (json, datetime, collections, itertools, os, pathlib, re, etc.) with property-based tests. This teaches the model Python’s standard library idioms at a depth no scraped dataset matches.

5.2.2 Local ML Framework Code (Tier 2)

Large, high-quality Python codebases from our local repos. Not upsampled — used at natural frequency for pattern diversity.

CorpusPathFilesNotes
huggingface-fine-tuning../huggingface-fine-tuning/12,274Fine-tuning recipes and examples
llms-with-huggingface../llms-with-huggingface/13,869LLM integration patterns
HF-Hub-Ecosystem../HF-Hub-Ecosystem/16,978Comprehensive HF Hub code
pytorch../pytorch/4,217ML framework fundamentals
vllm../vllm/2,400Inference serving
databricks-data-engineering../databricks-data-engineering/3,038Data engineering patterns
algorithm-competition-corpus../algorithm-competition-corpus/201Algorithms + data structures
coursera-stats../coursera-stats/430Statistical modeling
cuda-python../cuda-python/161GPU computing
Total53,568All MIT / Apache 2.0

5.2.3 Pre-Built Local Datasets

FilePathFormatSize
hf_gtc_corpus.parquet../hf-ground-truth-corpus/hf_gtc_corpus.parquetParquet2 MB
corpus_manifest_v1.json../depyler/corpus_manifest_v1.jsonJSONTier metadata
corpus_tiers.json../depyler/corpus_tiers.jsonJSONComplexity metrics

5.2.4 Data Sourcing Summary

Local owned data (~71K files, ~1-2M lines):
├── Tier 1: Ground truth corpora (17,588 files) → upsampled 10x
├── Tier 2: ML framework code   (53,568 files) → natural frequency
└── Pre-built: Parquet + JSON manifests

External data (HuggingFace, ~8B tokens):
├── StarCoder Python subset     (~4B tokens)   → bulk diversity
├── FineWeb-Edu                 (~2B tokens)   → general language
├── Python textbooks/tutorials  (~1B tokens)   → educational code
└── Python docs + PEPs + SO     (~1B tokens)   → API knowledge

Sovereign data advantage: 20% of training tokens come from data we own, curate, and can improve. Unlike scraped web data, we know the provenance, license, and quality of every file. If benchmarks reveal weaknesses in specific Python patterns, we can add targeted examples to our ground truth corpora and retrain — a feedback loop no public-dataset-only approach can match.

5.3 Fill-in-the-Middle (FIM) Training

Code completion requires fill-in-the-middle capability, not just left-to-right generation. During training, a fraction of code sequences are transformed using the PSM (Prefix-Suffix-Middle) format:

<fim_prefix>def fibonacci(n):<fim_suffix>    return fib_sequence<fim_middle>
    fib_sequence = [0, 1]
    for i in range(2, n):
        fib_sequence.append(fib_sequence[-1] + fib_sequence[-2])
ParameterValueRationale
FIM rate50% of code sequencesSantaCoder/StarCoder standard
FIM formatPSM (Prefix-Suffix-Middle)Most common, best tooling support
Special tokens<fim_prefix>, <fim_suffix>, <fim_middle>Added to BPE vocabulary
Context splitRandom split point per sequenceUniform distribution over valid positions

Gap ALB-018: FIXEDalimentar fim supports PSM/SPM transforms. Verified: alimentar fim mixed.parquet -o out.parquet --rate 0.5 --format psm --seed 42 produces correct FIM-encoded sequences. Used in v2 data pipeline.

This is critical — without FIM, the model is a text generator, not a code completion engine.

5.4 Data Pipeline

# ── Step 1: Ingest local ground truth corpora (Tier 1 — highest quality) ──
alimentar import local ../depyler/examples/ ../depyler/tdd-book/tests/ \
  --lang python --output ./data/local/depyler.parquet
alimentar import local ../hf-ground-truth-corpus/ \
  --lang python --output ./data/local/hf-gtc.parquet
alimentar import local ../jax-ground-truth-corpus/ \
  --lang python --output ./data/local/jax-gtc.parquet
alimentar import local ../vllm-ground-truth-corpus/ \
  --lang python --output ./data/local/vllm-gtc.parquet

# ── Step 2: Ingest local ML framework code (Tier 2) ──
alimentar import local \
  ../huggingface-fine-tuning/ ../llms-with-huggingface/ ../HF-Hub-Ecosystem/ \
  ../pytorch/ ../vllm/ ../databricks-data-engineering/ \
  ../algorithm-competition-corpus/ ../coursera-stats/ ../cuda-python/ \
  --lang python --output ./data/local/ml-frameworks.parquet

# ── Step 3: Download external data (on intel — 300GB RAM) ──
alimentar import hf bigcode/starcoderdata --lang python --output ./data/starcoder-python/
alimentar import hf HuggingFaceFW/fineweb-edu --output ./data/fineweb-edu/

# ── Step 4: Quality validation ──
alimentar quality check ./data/local/ --profile ml-training
alimentar quality check ./data/starcoder-python/ --profile ml-training
alimentar quality check ./data/fineweb-edu/ --profile ml-training

# ── Step 5: Filter, dedup, shard ──
alimentar filter ./data/starcoder-python/ --lang python --min-tokens 32 --max-tokens 8192 \
  --dedup --output ./data/processed/starcoder-python.parquet
alimentar convert ./data/fineweb-edu/ ./data/processed/fineweb-edu.parquet

# ── Step 6: Build training mix with upsampling ──
alimentar mix \
  --input ./data/processed/starcoder-python.parquet --weight 0.40 \
  --input ./data/local/depyler.parquet --weight 0.025 --upsample 10 \
  --input ./data/local/hf-gtc.parquet --weight 0.025 --upsample 10 \
  --input ./data/local/jax-gtc.parquet --weight 0.025 --upsample 10 \
  --input ./data/local/vllm-gtc.parquet --weight 0.025 --upsample 10 \
  --input ./data/local/ml-frameworks.parquet --weight 0.10 \
  --input ./data/processed/fineweb-edu.parquet --weight 0.20 \
  --input ./data/processed/textbooks.parquet --weight 0.10 \
  --input ./data/processed/python-docs.parquet --weight 0.10 \
  --output ./data/mixed/ \
  --seed 42 --shuffle

# ── Step 7: Record provenance ──
alimentar provenance ./data/mixed/ --output ./data/provenance.json

Gap ALB-019: FIXEDalimentar import local expects data files (CSV/JSON/Parquet), not source code directories. Workaround: scripts/source-to-parquet.py converts Python source repos to Parquet with the Tier 1 schema (file, source, text columns). Used for all Tier 2 imports.

Gap ALB-020: FIXEDalimentar mix supports weighted proportional sampling. Syntax: alimentar mix file1.parquet:10.0 file2.parquet:1.0 -o out.parquet.

5.4.1 Actual Pipeline (v2 Dataset — 2026-03-03)

The pipeline below produced the v2 dataset (139M tokens, 67,977 sequences):

# ── Step 1: Convert Tier 2 repos to Parquet (alimentar can't read source dirs) ──
for repo in pytorch hf-repos mlflow vllm-full tgi algo-corpus cuda-python llms-with-hf; do
    python3 scripts/source-to-parquet.py ~/src/$repo $repo data/parquet/tier2/$repo.parquet
done
# Result: 28,553 Python files across 8 repos

# ── Step 2: Mix Tier 1 (10x) + Tier 2 (1x) ──
alimentar mix \
  data/parquet/depyler/shard_0000.parquet:10.0 \
  data/parquet/hf-ground-truth/shard_0000.parquet:10.0 \
  data/parquet/jax/shard_0000.parquet:10.0 \
  data/parquet/vllm/shard_0000.parquet:10.0 \
  data/parquet/tier2/pytorch.parquet:1.0 \
  data/parquet/tier2/hf-repos.parquet:1.0 \
  data/parquet/tier2/mlflow.parquet:1.0 \
  data/parquet/tier2/vllm-full.parquet:1.0 \
  data/parquet/tier2/tgi.parquet:1.0 \
  data/parquet/tier2/algo-corpus.parquet:1.0 \
  data/parquet/tier2/cuda-python.parquet:1.0 \
  data/parquet/tier2/llms-with-hf.parquet:1.0 \
  -o data/staging/mixed-expanded.parquet --seed 42
# Result: 45,420 mixed rows

# ── Step 3: Apply FIM (50% PSM) ──
alimentar fim data/staging/mixed-expanded.parquet \
  -o data/staging/mixed-expanded-fim.parquet --rate 0.5 --format psm --seed 42
# Result: 45,420 rows with ~50% FIM-encoded

# ── Step 4: Pretokenize into 2048-length sequences ──
python3 scripts/pretokenize.py \
  --input data/staging/mixed-expanded-fim.parquet \
  --tokenizer models/albor-tokenizer-v2/tokenizer.json \
  --seq-len 2048 \
  --output data/pretokenized-2048-v2/train/train.parquet
# Result: 67,977 sequences × 2048 = 139,218,944 tokens (191 MiB)

# Validation set: reuse v1
cp data/pretokenized-2048/val/val.parquet data/pretokenized-2048-v2/val/val.parquet

5.4.2 v3 Dataset Pipeline — codeparrot-clean (2026-03-05)

The v3 dataset scales from 139M to ~5.3B tokens using codeparrot/codeparrot-clean (5M Python files on HuggingFace, no gating). Quality filtered and pretokenized at seq_len=1024 for the 350M model’s max_position_embeddings.

# Step 1: Stream and filter from HuggingFace (2M files, ~8 min)
python3 scripts/download-codeparrot.py \
  --output /mnt/nvme-raid0/albor-data/codeparrot-clean/ \
  --max-rows 2000000
# Filters: skip autogenerated, alpha_frac < 0.25, files > 100KB, < 50 chars
# Result: 2,000,000 files in 20 shards (6.1 GB), ~4.4B raw tokens est.

# Step 2: Pretokenize at seq_len=1024 (streaming shard-by-shard)
python3 scripts/pretokenize.py \
  --input /mnt/nvme-raid0/albor-data/codeparrot-clean/ \
  --tokenizer models/albor-tokenizer-v2/tokenizer.json \
  --seq-len 1024 \
  --output data/pretokenized-1024-v3/train/ \
  --text-column text --shard-output
# Result: ~5.2M sequences × 1024 = ~5.3B tokens in 20 output shards

# Validation set: reuse v1 (814 sequences)

5.5 Tokenizer

Existing capability: aprender::text::tokenize::BpeTokenizer with full train() / encode() / decode() support. entrenar::tokenizer::BPETokenizer provides the training-pipeline integration.

# Plan: validate inputs, estimate vocab training time
apr tokenize plan \
  --input ./data/processed/*.parquet \
  --vocab-size 32768 \
  --algorithm bpe \
  --output ./models/albor-tokenizer/

# Apply: train the tokenizer
apr tokenize apply \
  --input ./data/processed/*.parquet \
  --vocab-size 32768 \
  --algorithm bpe \
  --output ./models/albor-tokenizer/ \
  --seed 42

Gap ALB-001: Verify apr tokenize plan/apply exists as a CLI subcommand. If not, wire aprender::text::tokenize::BpeTokenizer::train() into apr with the plan/apply contract (see §1.5.2).