Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

16. Reproducibility Protocol

Every artifact in the albor pipeline is reproducible from source. This chapter documents the exact commands, seeds, and checksums needed to reproduce the full training pipeline from raw code corpora to trained model.

16.1 Artifact Tracking

ArtifactHow Recorded
Random seed42 (global), per-component seeds derived
Data versionsHuggingFace dataset commit SHAs + local repo git SHAs
Data provenancedocs/PROVENANCE.md (source path, git SHA, file count, token count per source)
Data checksumsSHA-256 of every Parquet shard (recorded in PROVENANCE.md)
Tokenizer v1models/albor-tokenizer/ (vocab.json + merges.txt + tokenizer.json)
Tokenizer v2models/albor-tokenizer-v2/tokenizer.json (ByteLevel BPE)
Training configYAML checked into git (configs/train/*.yaml)
Checkpoint hashesSHA-256 of model.safetensors
Software versionsapr --version, alimentar --version, pv --version
Hardwarenvidia-smi + free -h captured in training logs
Training logscheckpoints/*/training.log + final_model.json
Eval resultsconfigs/eval/*.jsonl (benchmarks) + eval scripts

16.2 Full Reproduction Commands

Step 1: Corpus Preparation

v1 pipeline (Tier 1 only, 17K rows):

# Import Tier 1 ground truth corpora
alimentar import local /path/to/depyler -o data/raw/depyler.parquet
alimentar import local /path/to/hf-ground-truth-corpus -o data/raw/hf.parquet
alimentar import local /path/to/jax-ground-truth-corpus -o data/raw/jax.parquet
alimentar import local /path/to/vllm-ground-truth-corpus -o data/raw/vllm.parquet

# Mix training split (weighted sampling)
alimentar mix \
    data/raw/depyler.parquet:0.4 \
    data/raw/hf.parquet:0.3 \
    data/raw/jax.parquet:0.15 \
    data/raw/vllm.parquet:0.15 \
    -o data/tokenized/train/mixed.parquet \
    --seed 42

v2 pipeline (Tier 1 10x + 8 Tier 2 repos, 45K rows → 68K sequences):

# Convert Tier 2 source repos to Parquet (alimentar can't read source dirs)
for repo in pytorch hf-repos mlflow vllm-full tgi algo-corpus cuda-python llms-with-hf; do
    python3 scripts/source-to-parquet.py ~/src/$repo $repo data/parquet/tier2/$repo.parquet
done

# Mix Tier 1 (10x upsampled) + Tier 2 (1x)
alimentar mix \
    data/parquet/depyler/shard_0000.parquet:10.0 \
    data/parquet/hf-ground-truth/shard_0000.parquet:10.0 \
    data/parquet/jax/shard_0000.parquet:10.0 \
    data/parquet/vllm/shard_0000.parquet:10.0 \
    data/parquet/tier2/pytorch.parquet:1.0 \
    data/parquet/tier2/hf-repos.parquet:1.0 \
    data/parquet/tier2/mlflow.parquet:1.0 \
    data/parquet/tier2/vllm-full.parquet:1.0 \
    data/parquet/tier2/tgi.parquet:1.0 \
    data/parquet/tier2/algo-corpus.parquet:1.0 \
    data/parquet/tier2/cuda-python.parquet:1.0 \
    data/parquet/tier2/llms-with-hf.parquet:1.0 \
    -o data/staging/mixed-expanded.parquet --seed 42

# Apply FIM (50% PSM)
alimentar fim data/staging/mixed-expanded.parquet \
    -o data/staging/mixed-expanded-fim.parquet --rate 0.5 --format psm --seed 42

Step 2: Tokenizer Training

# v1 tokenizer (whitespace-split BPE — has ALB-036 limitation)
apr tokenize apply \
    --data data/staging/corpus-raw.txt \
    --vocab-size 32768 \
    --algorithm bpe \
    -o models/albor-tokenizer/ \
    --max-lines 100000

# v2 tokenizer (ByteLevel BPE — preserves whitespace)
python scripts/train-tokenizer-v2.py \
    --corpus data/staging/corpus-raw.txt \
    --vocab-size 32768 \
    --output models/albor-tokenizer-v2/

Step 3: Pre-Tokenization

# Pre-tokenize full training data (v2 tokenizer, 2048-token chunks)
python scripts/pretokenize.py \
    --input data/tokenized/train/mixed.parquet \
    --tokenizer models/albor-tokenizer-v2/tokenizer.json \
    --seq-len 2048 \
    --output data/pretokenized-2048/train/train.parquet

# Pre-tokenize validation data
python scripts/pretokenize.py \
    --input data/tokenized/val/val.parquet \
    --tokenizer models/albor-tokenizer-v2/tokenizer.json \
    --seq-len 2048 \
    --output data/pretokenized-2048/val/val.parquet

Step 4: Model Training

# 50M pipeline validation (< 2 minutes)
make train-50m
# Equivalent to:
# apr train apply --task pretrain --config configs/train/pretrain-50m.yaml

# 350M base model, v2 data (~20 hours on RTX 4090)
apr train apply --task pretrain --config configs/train/pretrain-350m-v2.yaml
# v2 config: epochs=38, warmup=500, 67977 seqs, 5000 max_steps
# C-TRAINCFG-001 verified: steps_per_epoch=132, 38×132=5016 >= 5000

# Legacy v1 (22K seqs, fixed epochs=117 post ALB-060)
# apr train apply --task pretrain --config configs/train/pretrain-350m.yaml

Step 5: Checkpoint Conversion (for evaluation)

# Convert entrenar 1D-flat SafeTensors to realizar 2D format
python scripts/convert-checkpoint.py checkpoints/albor-base-350m/ \
    --config configs/train/pretrain-350m.yaml

Step 6: Evaluation

# Validate all benchmarks (no model needed)
make eval-validate

# Perplexity evaluation (needs trained model)
make eval-perplexity-350m

# Monitor active training
make training-status

16.3 Key SHA-256 Checksums

See docs/PROVENANCE.md for complete checksums. Key artifacts:

ArtifactSHA-256 (first 8 hex)
Training data (mixed.parquet)bdfe8742
Val data (val.parquet)6be03768
v1 tokenizer (vocab.json)aca6fa72
v2 tokenizer (tokenizer.json)d999cc9e
Pre-tokenized train (2048)4f54e422
Pre-tokenized val (2048)c9c1d093

16.4 Verification

# Verify data checksums
sha256sum data/tokenized/train/mixed.parquet
sha256sum data/pretokenized-2048/train/train.parquet
sha256sum models/albor-tokenizer-v2/tokenizer.json

# Verify training config reproducibility
apr train plan --task pretrain --config configs/train/pretrain-350m.yaml

# Verify contract integrity
pv validate contracts/*.yaml
pv coverage contracts
pv audit contracts/*.yaml