Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Appendix G: Data Pipeline

Documents the Phase 1 data ingestion, tokenization, and augmentation pipeline.

Source Corpora

SourceRepositoryFilesRowsParquet Size
depylerdepyler examples + TDD book1,8431,8436MB
hf-ground-truthHuggingFace ground truth corpus11,92811,493197MB
jax-ground-truthJAX ground truth corpus2,6972,63750MB
vllm-ground-truthvLLM ground truth corpus1,1181,10018MB

All sources are Python code, collected via alimentar import local.

Training Mix

Weighted sampling with Tier 1 (depyler) upsampled:

alimentar mix \
  depyler.parquet:0.4 \
  hf.parquet:0.3 \
  jax.parquet:0.15 \
  vllm.parquet:0.15 \
  --output mixed.parquet \
  --seed 42

Result: 17,070 rows (depyler upsampled 3.7x from 1,843 to ~6,829).

Data Splits

SplitRowsSizeSeedWeights
train17,070201MB42depyler:0.4 hf:0.3 jax:0.15 vllm:0.15
val5007MB123equal 0.25 each
test2002.4MB456equal 0.25 each

FIM Augmentation

Fill-in-the-Middle transforms applied via alimentar fim:

alimentar fim mixed.parquet \
  --output mixed-fim.parquet \
  --column text \
  --rate 0.5 \
  --format psm \
  --seed 42
  • Format: PSM (Prefix-Suffix-Middle)
  • Rate: 50% of rows receive FIM transform
  • Sentinel tokens: <|fim_prefix|>, <|fim_suffix|>, <|fim_middle|>

BPE Tokenizer

Trained via apr tokenize apply:

apr tokenize apply \
  --data corpus-raw.txt \
  --vocab-size 32768 \
  --algorithm bpe \
  --max-lines 100000 \
  -o tokenizer/

Results:

  • Final vocab size: 32,768
  • Merges: 32,518
  • Training time: 2022.5s (~33.7 min)
  • Training data: 100K lines of Python code
  • Special tokens: <unk>, <s>, </s>, <pad>
  • Python pattern coverage: 8/8 (def, return, self, import, class, for, if, in)
  • Output: tokenizer/vocab.json + tokenizer/merges.txt

HuggingFace tokenizer.json Conversion

Entrenar requires HuggingFace tokenizer.json format, but apr tokenize apply produces raw vocab.json + merges.txt. A Python conversion step bridges the gap (ALB-033):

from tokenizers import Tokenizer, models, pre_tokenizers, decoders
bpe = models.BPE(vocab=vocab, merges=merges, end_of_word_suffix='</w>')
tokenizer = Tokenizer(bpe)
tokenizer.pre_tokenizer = pre_tokenizers.Split(pattern=' ', behavior='removed')
tokenizer.decoder = decoders.BPEDecoder(suffix='</w>')
tokenizer.save('models/albor-tokenizer/tokenizer.json')

Key details:

  • Merges must be string format ("i n") not array format (["i", "n"])
  • Pre-tokenizer matches aprender’s split_whitespace() behavior
  • </w> end-of-word suffix matches aprender’s BPE encoding
  • Regular vocab: 32,768 tokens (IDs 0-32767)
  • FIM special tokens: 3 additional (IDs 32768-32770)

Parquet Schema

All data files use a consistent schema:

{
  text: Utf8,    -- Python source code
  source: Utf8,  -- Corpus name (depyler, hf, jax, vllm)
  file: Utf8     -- Original file path
}

Provenance

SHA-256 hashes for all data artifacts are recorded in docs/PROVENANCE.md. Each split uses a different random seed for reproducibility.

ByteLevel BPE Tokenizer (v2)

The v1 tokenizer (from apr tokenize apply) normalizes whitespace, which loses Python indentation. The v2 tokenizer uses ByteLevel BPE (like GPT-2/CodeLlama):

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tokenizer.decoder = decoders.ByteLevel()
trainer = trainers.BpeTrainer(vocab_size=32768, special_tokens=[...])
tokenizer.train(["corpus-raw.txt"], trainer)
tokenizer.save("models/albor-tokenizer-v2/tokenizer.json")
  • Vocab: 32,768 (same size, different encoding)
  • Roundtrip: 6/6 PASS (preserves newlines, indentation, blank lines)
  • Merges: 32,557

Pre-Tokenized Data

Training data pre-tokenized and chunked for efficient training:

DatasetSequencesSeq LengthTotal TokensFormat
pretokenized-2048/train (v1)22,079204845.2MParquet (input_ids: List<u32>)
pretokenized-2048/val81420481.7MParquet (input_ids: List<u32>)
pretokenized-2048-v2/train67,9772048139MParquet (input_ids: List<u32>)
pretokenized-2048-v2/val81420481.7MParquet (reused from v1)

Pre-tokenization avoids the entrenar↔aprender BPE compatibility issue (ALB-033) and enables direct input_ids column loading.

v2 Data Expansion (2026-03-03)

The v2 dataset expands from Tier 1 only to Tier 1 (10x upsampled) + 8 Tier 2 repos:

SourceTypeFilesWeight
depylerTier 11,84310x
hf-ground-truthTier 111,49310x
jax-ground-truthTier 12,63710x
vllm-ground-truthTier 11,10010x
pytorchTier 23,8011x
hf-reposTier 219,7811x
mlflowTier 21,7801x
vllm-fullTier 22,2391x
tgiTier 23721x
algo-corpusTier 21861x
cuda-pythonTier 21571x
llms-with-hfTier 2371x

Pipeline: source-to-parquet.py → alimentar mix → alimentar fim (50% PSM) → pretokenize.py

Key finding: alimentar import local expects data files (CSV/JSON/Parquet), not source code directories. The workaround script scripts/source-to-parquet.py converts Python repos to Parquet with the Tier 1 schema (file, source, text columns).

Result: 45,420 mixed rows → 67,977 pretokenized sequences × 2048 = 139M tokens (191 MiB).

Tools Used

  • alimentar import local — JSONL to Parquet conversion
  • alimentar mix — weighted sampling with upsampling
  • alimentar fim — Fill-in-the-Middle augmentation
  • apr tokenize plan/apply — BPE vocabulary training (v1, whitespace-split)
  • Python tokenizers — ByteLevel BPE training (v2, whitespace-preserving)
  • scripts/source-to-parquet.py — Python source code to Parquet (for Tier 2 repos)
  • entrenar (parquet feature) — Parquet-to-LMBatch bridge for training